Compare commits
9 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 222815d374 | |||
| d9e767d048 | |||
| 313dee503c | |||
| b1d6fafd3a | |||
| cc1bbd0ab7 | |||
| e33e76f2d1 | |||
| bd49c93703 | |||
| 9a02955770 | |||
| 5f8dd688f5 |
@@ -371,6 +371,13 @@ Concrete forms that Part 1's *Verification Plan* MUST take in this repo:
|
||||
- `kernbench run --device <id>` runs the benchmark on a single device.
|
||||
- Omitting `--device` runs the benchmark on all devices discovered in the topology (logically parallel).
|
||||
- Device enumeration is handled by the CLI only; benchmarks MUST remain single-device.
|
||||
- **Eval-bench exception (ADR-0054)**: a *milestone / eval bench*
|
||||
(`milestone-1h-*`) may drive many configurations and build its own
|
||||
per-config engines to regenerate a domain's full result + figure set; it
|
||||
ignores `--device` and submits a sentinel tensor to satisfy the
|
||||
"must submit ≥1 request" contract (ADR-0045 D4). This is the eval-harness
|
||||
carve-out to the single-device rule, alongside the ADR-0024 multi-SIP CCL
|
||||
exception.
|
||||
|
||||
## Derived Artifacts (Clarification)
|
||||
|
||||
|
||||
@@ -7,6 +7,11 @@ Accepted
|
||||
`tests/sccl/` 평가 하니스를 문서화한다; 구현과 대조 검증 완료
|
||||
(상수, 파일 집합, 스윕 차원을 교차 확인).
|
||||
|
||||
**ADR-0054로 개정됨**: 드라이버 코어, sweep, renderer가 `milestone-1h-ccl`
|
||||
bench(단일 home)로 이동했다; `tests/sccl/_allreduce_helpers.py`는 이제 거기서
|
||||
re-export한다(pytest 전용 param 빌더 + `_run_distributed` wrapper는 로컬
|
||||
유지). figure 테스트는 변경 없음.
|
||||
|
||||
## Context
|
||||
|
||||
ADR-0032는 intercube all-reduce *알고리즘*을 정의하고, ADR-0023/0024/0027은
|
||||
|
||||
@@ -8,6 +8,12 @@ GEMM 평가/특성화 하니스를 문서화한다; 구현과 대조 검증 완
|
||||
(상수, tile 크기, figure 집합, script↔test 분할을 교차 확인). D5/D6
|
||||
caveat은 부정확이 아니라 기록된 한계다.
|
||||
|
||||
**ADR-0054로 개정됨**: sweep + renderer가 `milestone-1h-gemm` bench(단일
|
||||
home)로 이동했다; `scripts/gemm_sweep.py`와 `tests/gemm/`는 이제 거기서
|
||||
re-export한다. D1/D2의 "데이터 생성은 수동 script / 무거운 작업은 opt-in"은
|
||||
평가-bench 패턴으로 대체된다(하나의 bench가 전부 재생성;
|
||||
`MILESTONE_FAST=1`은 committed JSON 재사용).
|
||||
|
||||
## Context
|
||||
|
||||
ADR-0014(PE pipeline)와 ADR-0042(tile-plan generator)는 GEMM *구현*을
|
||||
|
||||
@@ -0,0 +1,265 @@
|
||||
# ADR-0045: Bench Module Contract — registration, dispatch, and authoring
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-21).
|
||||
|
||||
`src/kernbench/benches/` 패키지의 등록 메커니즘(@bench), CLI 디스패치 경로
|
||||
(`kernbench run/list`), 그리고 새 bench 모듈 작성 시 따라야 할 계약을 통합
|
||||
정의한다. ADR-0010 (CLI surface)이 `kernbench list/run` 인터페이스를 명세하나,
|
||||
**bench가 어떻게 등록되고 어떤 함수 시그너처를 따라야 하는가**는 ADR 레벨에
|
||||
없었음.
|
||||
|
||||
**ADR-0054로 확장됨**: D5의 단일 구성 규칙에 세 번째 패턴이 추가된다 —
|
||||
*평가 bench*(예: `milestone-1h-*`)는 여러 구성을 구동하고, 구성별 자체 엔진을
|
||||
빌드하며, D4를 만족시키기 위해 sentinel 텐서를 제출한다.
|
||||
|
||||
## First action (제일 처음에 하는 일)
|
||||
|
||||
`kernbench.benches` 패키지가 임포트되면 `__init__.py` 가 즉시
|
||||
`_eager_import_and_audit(__path__, __name__)` 를 호출한다. 이 함수의 첫 일은
|
||||
패키지 디렉터리 안의 모든 형제 모듈을 `pkgutil.iter_modules(__path__)`로 나열한
|
||||
뒤, 다음 두 조건을 만족하지 않는 모듈을 모두 `importlib.import_module(...)`로
|
||||
**즉시 로드**하는 것이다:
|
||||
|
||||
- 이름이 `registry` 인 경우 (인프라 자체)
|
||||
- 이름이 `_` 로 시작하는 경우 (helper 모듈)
|
||||
|
||||
임포트 시점에 각 모듈 안의 `@bench(name=..., description=...)` 데코레이터가
|
||||
실행되어 `_PENDING` 리스트에 `(name, description, fn)` 튜플이 append 되고,
|
||||
`_REGISTERED_MODULES` 셋에 `fn.__module__` 가 추가된다.
|
||||
|
||||
전체 임포트가 끝나면 `_audit_modules(imported, _REGISTERED_MODULES)` 가 호출되어,
|
||||
**임포트는 되었지만 @bench를 한 번도 호출하지 않은 모듈**이 있으면
|
||||
`RuntimeError("Bench module(s) missing @bench decorator: ...")` 가 즉시 발생한다.
|
||||
이 audit이 통과한 시점에 인덱스 할당은 아직 일어나지 않은 상태이며, 첫
|
||||
`list_all()` / `resolve(...)` 호출 시 `_finalize()` 가 이름 알파벳 정렬 순으로
|
||||
1-based index를 부여한다.
|
||||
|
||||
즉, **bench 인프라의 첫 일은 "패키지 디렉터리의 모든 비-helper 모듈을 임포트
|
||||
하고, 각 모듈이 최소 한 번 @bench를 호출했는지 감사하는 것"** 이다.
|
||||
|
||||
## Context
|
||||
|
||||
`src/kernbench/benches/` 는 현재 8개의 bench 모듈을 보유한다 (`ccl_allreduce`,
|
||||
`gemm_single_pe`, `gpt3_qkv`, `ipcq_allreduce`, `matmul_composite`, `qkv_gemm`,
|
||||
`qkv_gemm_multi_pe`, `va_offset_verify`). 모든 bench는 다음 통합 흐름을 따른다:
|
||||
|
||||
```
|
||||
kernbench run --topology <T> --bench <N>
|
||||
↓
|
||||
cli/main.py::cmd_run
|
||||
↓ resolve_topology(T) + resolve(N) + resolve_device(device_arg)
|
||||
↓
|
||||
runtime_api/bench_runner.py::run_bench(topology, bench_fn, device, engine_factory)
|
||||
↓ engine_factory(topology, device) → GraphEngine
|
||||
↓ RuntimeContext(engine, target_device, correlation_id, spec)
|
||||
↓
|
||||
bench_fn(ctx) ← bench가 정의한 run(torch) 가 호출됨
|
||||
↓ ctx.empty/zeros/from_numpy/launch/distributed.* 등을 통해 submit
|
||||
↓
|
||||
ctx.wait_all() ← 미완료 핸들이 있으면 drain
|
||||
↓
|
||||
BenchResult(completion, correlation_id, trace, traces, engine)
|
||||
```
|
||||
|
||||
ADR-0010 은 CLI 표면만 다루고 (`run/list/probe/web`), ADR-0007 은 runtime API ↔
|
||||
sim_engine 책임 경계만 다룬다. 정작 "새 bench 파일을 추가하려면 어떤 모양으로
|
||||
써야 하는가"는 코드 컨벤션만으로 추적해야 한다. 결과적으로:
|
||||
|
||||
- @bench 데코레이터의 호출 규약 (kebab-case 이름, non-empty description)이
|
||||
코드에만 존재.
|
||||
- bench 함수 시그너처 (`def run(torch)`) 가 사실상 컨벤션인데, CLI 디스패치 측이
|
||||
`spec.run` 을 호출한다는 사실로 강제되고 있음.
|
||||
- 신규 bench 추가자가 "helper 모듈은 `_` 접두로 분리해야 한다"는 것을 audit
|
||||
RuntimeError를 받아본 뒤에야 학습.
|
||||
- single-device 컨벤션 (CLAUDE.md Part 2 CLI Semantics)이 bench 작성자 관점에서
|
||||
어디까지 적용되는지 (CCL 멀티-SIP bench는 예외인가?) 명확하지 않음.
|
||||
|
||||
이 ADR이 이런 모호함을 한 곳에 정리한다.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. @bench 데코레이터 계약
|
||||
|
||||
```python
|
||||
from kernbench.benches.registry import bench
|
||||
|
||||
@bench(name="my-bench", description="Short, complete-sentence description.")
|
||||
def run(torch):
|
||||
...
|
||||
```
|
||||
|
||||
- `name`: kebab-case 문자열. 정규식 `^[a-z][a-z0-9]*(-[a-z0-9]+)*$` 통과 필요.
|
||||
소문자/숫자/대시만 허용; 밑줄(`_`) 금지; 알파벳으로 시작.
|
||||
- `description`: non-empty 문자열 (strip 후 길이 > 0). CLI `list` 출력에 그대로
|
||||
표시됨.
|
||||
- 데코레이터는 **fn을 변형 없이 반환**한다 — 즉 직접 호출도 가능. 부수효과로
|
||||
`_PENDING` 에 등록만 추가한다.
|
||||
|
||||
위 두 규칙 위반은 즉시 `ValueError`. duplicate name은 `_finalize()` 시점에
|
||||
`RuntimeError("duplicate bench name: ...")` 로 잡힌다.
|
||||
|
||||
### D2. 모듈 파일 컨벤션
|
||||
|
||||
`src/kernbench/benches/<slug>.py` 는 다음 중 하나여야 한다:
|
||||
|
||||
- **bench 모듈**: 최상위 임포트 경로에서 적어도 한 번 `@bench(...)` 가 실행되어
|
||||
최소 하나의 bench를 등록한다.
|
||||
- **helper 모듈**: 파일명이 `_` 로 시작 (예: `_shared_helpers.py`). `iter_modules`
|
||||
순회에서 스킵된다.
|
||||
|
||||
audit (`_audit_modules`) 는 helper가 아닌데도 @bench를 호출하지 않은 모듈을
|
||||
허용하지 않는다. 의도된 결과: 새 파일을 `benches/` 에 추가하기만 하면 자동
|
||||
등록되며, helper와의 구분은 **파일명 접두사** 하나로 명확하게 표시된다.
|
||||
|
||||
### D3. bench 함수 시그너처는 `def run(torch)` 다
|
||||
|
||||
데코레이터는 함수 이름을 강제하지 않지만, **CLI 디스패치는 `spec_entry.run`
|
||||
(즉 데코레이트된 callable) 을 호출**한다. 따라서 컨벤션은:
|
||||
|
||||
- 함수 이름: `run`. 다른 이름으로 데코레이트해도 동작은 하지만 readability /
|
||||
grep-ability 측면에서 항상 `run`.
|
||||
- 인자: 단일 위치 인자 `torch`. 실제로는 `RuntimeContext` 인스턴스이며 PyTorch
|
||||
스타일의 namespace (zeros/empty/launch/distributed/...)를 노출한다 (ADR-0024 D3).
|
||||
- 반환값: 임의 (`Any`). 현재 `run_bench` 는 반환값을 무시하고 `ctx.handles()` /
|
||||
`engine.get_completion()` 로 완료를 추적한다.
|
||||
|
||||
`torch` 이름은 PyTorch 호환 idiom을 흉내내기 위함이며, 실제로 PyTorch 모듈이
|
||||
들어오는 것은 아니다 (ADR-0024 의 "rank = SIP" launcher 컨벤션과 정렬).
|
||||
|
||||
### D4. bench는 최소 한 번의 submit을 수행해야 한다
|
||||
|
||||
`run_bench` 는 `ctx.handles()` 가 비어 있는 경우 BenchResult.completion 을
|
||||
`ok=False, error_code="NO_REQUESTS"` 로 반환한다. 따라서 의미 있는 bench는
|
||||
다음 중 하나 이상을 호출해야 한다:
|
||||
|
||||
- 텐서 생성 API: `torch.zeros(...)`, `torch.empty(...)` — 내부적으로
|
||||
`MmuMapMsg` 와 (zeros 의 경우) `MemoryWriteMsg` 가 submit 됨.
|
||||
- 커널 실행 API: `torch.launch(name, fn, *args)` — `KernelLaunchMsg` 를 SIP 별로
|
||||
submit.
|
||||
- (예외) 빈 placeholder bench: `ipcq_allreduce.py` 처럼 `print(...)` 만 하는
|
||||
스텁은 NO_REQUESTS 결과를 받게 됨. CI 측에서 placeholder임을 인지하고 별도
|
||||
처리하는 것을 가정한다.
|
||||
|
||||
### D5. 단일-디바이스 컨벤션 + 멀티-SIP 예외 (ADR-0024/0027)
|
||||
|
||||
CLAUDE.md Part 2 CLI Semantics 가 명시하는 **"benchmarks MUST remain
|
||||
single-device"** 컨벤션은 다음과 같이 해석된다:
|
||||
|
||||
- **일반 bench (single-SIP 사용)**: `dp = DPPolicy(...)` 로 텐서 placement를
|
||||
정의하고 `torch.launch(...)` 로 커널 발사. SIP 인덱스는 `--device` 가
|
||||
결정한다 (CLI 측 책임).
|
||||
- **CCL bench (멀티-SIP 사용)**: 예외적으로 `torch.distributed.init_process_group
|
||||
(backend="ahbm")` + `torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` 로
|
||||
rank = SIP 패턴 (ADR-0024 D3) 을 따른다. `--device` 는 무시되며 (또는
|
||||
`all` 로 가정), 각 spawned worker가 `torch.ahbm.set_device(rank)` 로 자신의
|
||||
SIP를 바인딩한다.
|
||||
|
||||
이 두 패턴 외의 멀티-디바이스 호출 (예: 한 bench 함수가 동일 process에서 여러
|
||||
SIP을 직접 launch) 은 본 ADR이 금지한다. CLI 가 `--device all` 로 호출되어도
|
||||
bench는 한 번만 실행되며, 그 안에서 멀티-SIP을 다루려면 D5의 두 번째 패턴을
|
||||
사용한다.
|
||||
|
||||
### D6. 이름·인덱스 해석 (`resolve`)
|
||||
|
||||
`resolve(identifier: str)` 는 다음 순서로 BenchSpec을 반환한다:
|
||||
|
||||
1. `identifier.isdigit()` → 정수 변환 후 `_REGISTRY` 의 entries에서 `index ==`
|
||||
인 spec 반환. 없으면 `ValueError("No bench with index ..."`)`.
|
||||
2. `identifier in _REGISTRY` → 직접 lookup.
|
||||
3. 그 외 → `ValueError("Unknown bench ...")`.
|
||||
|
||||
빈/공백 identifier 는 `ValueError("bench identifier must be a non-empty string.")`.
|
||||
|
||||
CLI 는 `--bench` 의 인자를 그대로 `resolve` 에 넘긴다. 따라서 사용자는
|
||||
`kernbench run --bench gemm-single-pe` 또는 `kernbench run --bench 2` 형식 모두
|
||||
사용 가능.
|
||||
|
||||
### D7. 인덱스는 안정 API가 아니다
|
||||
|
||||
`_finalize()` 가 `_PENDING` 을 **이름 알파벳 정렬** 후 1-based index를 부여하므로,
|
||||
새 bench 가 추가되면 기존 bench의 index가 밀릴 수 있다. 따라서:
|
||||
|
||||
- 사람-친화적 인터랙티브 사용: 인덱스 OK.
|
||||
- 스크립트 / CI 자동화: 반드시 이름을 사용한다.
|
||||
|
||||
이 사실은 `registry.py` 모듈 docstring 에 명시되어 있다.
|
||||
|
||||
### D8. RuntimeContext 가 bench에 노출하는 표면
|
||||
|
||||
bench 함수가 `torch` 파라미터를 통해 정상적으로 사용할 수 있는 표면:
|
||||
|
||||
- **텐서 생성**: `torch.empty(shape, dtype=..., dp=DPPolicy(...), name=...)`,
|
||||
`torch.zeros(...)`, `torch.from_numpy(arr)`. 모두 host-side 메타 + 디바이스
|
||||
배포 (MmuMap + MemoryWrite) 를 submit 한다.
|
||||
- **커널 발사**: `torch.launch(kernel_name, kernel_fn, *args)` —
|
||||
`(Tensor, int, float)` 위치 인자를 `TensorArg` / `ScalarArg` 로 변환하여
|
||||
SIP 별 `KernelLaunchMsg` 발행 후 drain.
|
||||
- **동기화**: `torch.wait(handle)`, `torch.wait_all()` (run_bench 가 자동 호출).
|
||||
- **분산**: `torch.distributed.init_process_group(backend="ahbm")`,
|
||||
`torch.distributed.get_world_size()`, `torch.distributed.all_reduce(t, op=...)`
|
||||
(ADR-0024/0027).
|
||||
- **멀티-프로세스 (rank=SIP)**: `torch.multiprocessing.spawn(_worker, ..., nprocs=ws)`
|
||||
(ADR-0024 D3 / ADR-0027).
|
||||
- **디바이스 바인딩**: `torch.ahbm.set_device(rank)` 또는
|
||||
`torch.accelerator.set_device_index(rank)` (둘 다 같은 namespace를 가리킴).
|
||||
- **IPCQ 설치**: `torch.install_ipcq(algorithm=..., ccl_yaml=...)` (ADR-0023 D10).
|
||||
- **스펙 조회**: `torch.spec` — 토폴로지 빌더가 만든 dict (시스템·cube_mesh·HBM
|
||||
파라미터 등). bench가 toplogy.yaml 파라미터에 의존하지 않게 짜기 위함.
|
||||
|
||||
bench는 위에 열거되지 않은 RuntimeContext 의 private 멤버 (`_handles`, `_traces`,
|
||||
`_allocators` 등) 에 직접 접근해선 안 된다. ADR-0007 의 layer boundary 정신과
|
||||
정렬: bench → runtime API → sim_engine 한 방향만 허용.
|
||||
|
||||
### D9. 환경 변수로 파라미터화는 허용된다
|
||||
|
||||
`matmul_composite.py` 처럼 `os.environ.get("MATMUL_M", ...)` 등으로 bench
|
||||
파라미터를 외부에서 주입하는 패턴은 허용한다. 이유:
|
||||
|
||||
- bench 함수 시그너처는 D3 에 의해 `def run(torch)` 로 고정되어 있어 위치/키워드
|
||||
인자로 파라미터를 받기 곤란.
|
||||
- 환경 변수 패턴은 `MATMUL_VARIANT` 같은 운영-시 스윕을 위한 자연스러운 hook.
|
||||
- `scripts/gemm_sweep.py` 같은 외부 드라이버 (ADR-0044) 가 이 hook을 사용한다.
|
||||
|
||||
단, 환경 변수가 bench의 동작을 바꾼다면 모듈 docstring 에 모든 변수를 명시할 것
|
||||
(matmul_composite.py 가 그 예시).
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. 명시적 manifest 파일 (YAML)에 bench 목록 두기
|
||||
|
||||
기각. @bench 데코레이터 + audit 패턴은 "파일 추가 = 자동 등록" 을 보장하여 신규
|
||||
bench 작성자의 인지 비용을 한 곳 (파일 작성)으로 집중시킨다. 별도 manifest는
|
||||
유지보수 측에서 drift 위험이 크고, helper 분리는 이미 `_` 접두로 명확하다.
|
||||
|
||||
### A2. bench 함수 이름을 데코레이터 인자로 받기 (`@bench(name=..., entry="run_xxx")`)
|
||||
|
||||
기각. 디스패치 측에서 `spec.run` 하나만 호출하면 되는 단순함을 깬다. `run` 컨벤션
|
||||
하나로 충분하며, 변종이 필요하면 같은 모듈에 여러 함수를 등록하면 된다 (각각
|
||||
@bench 데코레이트).
|
||||
|
||||
### A3. CCL bench를 위한 별도 `@multi_device_bench` 데코레이터
|
||||
|
||||
기각. D5에서 명시한 두 패턴 (single + ADR-0024 멀티-SIP) 만으로 현재 8개 bench가
|
||||
모두 표현 가능. 별도 데코레이터는 디스패치 측에서 분기를 강제하여 복잡도를 늘리며,
|
||||
멀티-SIP 사용 의도는 bench 함수 본문의 `init_process_group(...)` 호출로 충분히
|
||||
드러난다.
|
||||
|
||||
### A4. 인덱스를 안정 API로 만들기 (등록 순서 / explicit index= 인자)
|
||||
|
||||
기각. D7에서 명시한 trade-off — 사용자 친화성 (알파벳 정렬된 인덱스가 list 출력
|
||||
에서 자연스럽게 1, 2, 3...) 우선. 스크립트는 이름으로 지정하면 충분.
|
||||
|
||||
## Consequences
|
||||
|
||||
- "bench 추가 방법" 이 한 ADR로 정리됨 → 신규 작성자가 코드 grep 없이 D1-D3,
|
||||
D8 만 따르면 됨.
|
||||
- helper 모듈을 `_` 접두로 분리하는 패턴이 ADR-level에서 정당화되어, 향후
|
||||
`benches/_*.py` 식의 공유 helper 작성이 자유로워짐.
|
||||
- CLAUDE.md Part 2 CLI Semantics 의 single-device 컨벤션이 멀티-SIP CCL bench
|
||||
와 모순되지 않음을 D5 가 명시 — 둘은 직교한다.
|
||||
- ADR-0044 (GEMM eval harness) 의 `scripts/gemm_sweep.py` 가 환경 변수 hook을
|
||||
사용하는 근거 (D9) 가 본 ADR에 굳어짐.
|
||||
- 인덱스가 불안정함 (D7) 이 명시되어, CI 측 `kernbench run --bench 3` 같은
|
||||
코드는 본 ADR 수락 직후 점검 대상.
|
||||
@@ -0,0 +1,307 @@
|
||||
# ADR-0046: TLContext — Kernel-side `tl.*` API Contract
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
`src/kernbench/triton_emu/` 의 `TLContext` 가 노출하는 `tl.*` primitive
|
||||
집합과 그 의미, 그리고 두 실행 모드 (command-list / greenlet runner) 의
|
||||
계약을 명시한다. ADR-0014/0020 가 PE 파이프라인과 2-pass 실행 모델을
|
||||
정의하나, **bench 의 kernel 함수가 호출하는 `tl.*` 표면 자체**는 ADR-level
|
||||
에 정리되어 있지 않았다.
|
||||
|
||||
## First action (제일 처음에 하는 일)
|
||||
|
||||
`TLContext(pe_id, num_programs, dispatch_cycles, runner, cube_id, num_cubes,
|
||||
scratch_base, scratch_size)` 생성 시 가장 먼저 다음 6개 필드를 초기화한다:
|
||||
|
||||
- `self._pe_id`, `self._num_programs`, `self._cube_id`, `self._num_cubes` —
|
||||
`tl.program_id` / `tl.num_programs` 가 반환할 값.
|
||||
- `self._dispatch_cycles` — 모든 `tl.*` API 호출 시작에서 자동으로 발행될
|
||||
`PeCpuOverheadCmd(cycles)` 의 cycle 수.
|
||||
- `self._runner` — `KernelRunner` 인스턴스 (있으면 greenlet 모드, 없으면
|
||||
command-list 모드).
|
||||
- `self._commands: list[PeCommand] = []` — command-list 모드에서 누적할
|
||||
command 시퀀스.
|
||||
- `self._handle_counter = 0`, `self._completion_counter = 0` — 새 TensorHandle /
|
||||
CompletionHandle id 생성용.
|
||||
- `self._scratch_base`, `self._scratch_size`, `self._scratch_cursor = 0` —
|
||||
PE-로컬 scratch 영역 (math/dot/composite 의 output handle 주소 할당용).
|
||||
|
||||
즉, **TLContext 의 첫 일은 "이 kernel 인스턴스가 어디서 (sip/cube/pe) 어떤
|
||||
규모 (num_programs/num_cubes) 로 실행되며, 어느 모드 (runner 유무) 로
|
||||
명령을 발사할지 메타데이터를 채우는 것"** 이다. 이 시점에 SimPy event 는
|
||||
없으며 command 도 발사되지 않는다.
|
||||
|
||||
런타임 첫 동작은 kernel 함수가 `tl.<api>()` 를 처음 호출할 때 발생한다.
|
||||
모든 `tl.*` API 의 표준 entry 동작은:
|
||||
|
||||
1. `self._emit_dispatch_overhead()` 호출 — `dispatch_cycles > 0` 인 경우
|
||||
`PeCpuOverheadCmd(dispatch_cycles)` 를 즉시 `_emit`.
|
||||
2. API 별 처리 (TensorHandle 생성, command 구성).
|
||||
3. `self._emit(cmd)` — runner 모드면 greenlet.switch 로 SimPy 측에 cmd 전달,
|
||||
아니면 `self._commands` 에 append.
|
||||
|
||||
## Context
|
||||
|
||||
`tl.*` 표면은 `TLContext` 가 노출하는 메소드들로 구성되며, kernel 함수가
|
||||
받는 `tl` 매개변수가 이 객체다. 사용자(bench 작성자) 입장에서 보이는
|
||||
contract:
|
||||
|
||||
- 어떤 primitive 가 있는가
|
||||
- 각 primitive 가 어떤 데이터 흐름을 발생시키는가 (DMA / compute / IPCQ /
|
||||
metadata-only)
|
||||
- TensorHandle 의 `space` 와 `addr` 가 어떻게 결정되는가
|
||||
- command-list 모드와 greenlet 모드의 차이
|
||||
|
||||
ADR-0014 (PE pipeline) 가 PE_SCHEDULER 가 받는 PeCommand 들을 정의하나,
|
||||
`tl.*` 가 이들을 어떻게 emit 하는지는 코드 컨벤션에만 존재한다. 또한
|
||||
ADR-0020 (2-pass data execution) 가 greenlet 모드의 존재를 D3 에서
|
||||
언급하나, runner / non-runner 두 경로의 시그너처 차이 (return value 처리)
|
||||
는 ADR-level 에 명시되어 있지 않다. 이 ADR 이 그 빈자리를 채운다.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. `tl` 매개변수는 `TLContext` 인스턴스다
|
||||
|
||||
bench 의 kernel 함수는 다음 시그너처를 따른다:
|
||||
|
||||
```python
|
||||
def _kernel(arg1, arg2, ..., tl, **kwargs):
|
||||
...
|
||||
```
|
||||
|
||||
`tl` 의 정체는 `kernbench.triton_emu.tl_context.TLContext` 인스턴스이다.
|
||||
real Triton 의 `triton.language` 모듈을 흉내내기 위한 이름이며, real
|
||||
Triton 모듈이 들어오는 것은 아니다.
|
||||
|
||||
kernel 함수는 일반 Python 함수이며 `yield` / `async` 가 없다. `tl.*`
|
||||
호출이 SimPy event 를 발생시키지만, 호출자(kernel) 쪽에서는 동기 호출처럼
|
||||
보인다 — greenlet 모드에서 KernelRunner 가 SimPy ↔ kernel 사이를 중계
|
||||
하기 때문 (ADR-0020 D3).
|
||||
|
||||
### D2. 두 실행 모드 — command-list / greenlet runner
|
||||
|
||||
- **command-list 모드 (`runner is None`)**: `tl.*` 호출이 `self._commands`
|
||||
리스트에 PeCommand 를 누적. DMA / GEMM / Math 가 실제 SimPy 시간을
|
||||
소비하지 않으며, return value 가 metadata-only TensorHandle (data=None) 다.
|
||||
이후 PE_SCHEDULER / sim_engine 가 command 시퀀스를 시간상 재생.
|
||||
|
||||
- **greenlet runner 모드 (`runner is not None`)**: `tl.*` 호출이
|
||||
`self._emit(cmd)` 를 통해 `runner.switch_to_simpy(cmd)` 로 부모 greenlet
|
||||
(SimPy) 으로 컨트롤을 넘김. 부모는 cmd 를 컴포넌트에 분배하여 SimPy 시간을
|
||||
소비한 뒤, DMA read 의 경우 실제 numpy 데이터를 반환. kernel 은 그
|
||||
결과를 받아 다음 line 으로 진행 (ADR-0020 D3 의 데이터 인지 실행 모델).
|
||||
|
||||
mode 선택은 KernelRunner 인스턴스를 TLContext 에 주입하는지 여부로 결정
|
||||
되며, `tl.*` 메소드들은 이 차이를 인지하지 않고 `_emit()` 헬퍼를 통해
|
||||
일관되게 동작한다.
|
||||
|
||||
### D3. Primitive 카테고리
|
||||
|
||||
#### D3.1. Reference (no DMA, metadata only)
|
||||
|
||||
- `tl.ref(ptr, shape, dtype="f16") -> TensorHandle`: HBM 데이터를 참조하는
|
||||
핸들만 만들고 DMA 는 발행하지 않음. composite scheduler 가 per-tile 로
|
||||
스트리밍할 때 사용 (예: GEMM 의 b 피연산자).
|
||||
|
||||
#### D3.2. Data movement (blocking, DMA engine)
|
||||
|
||||
- `tl.load(ptr, shape, dtype="f16") -> TensorHandle`: HBM → 결과 핸들.
|
||||
`DmaReadCmd` 발행. greenlet 모드에서는 결과 핸들의 `.data` 에 실제
|
||||
numpy 배열 첨부; command-list 모드에서는 placeholder. 반환 핸들의
|
||||
`space="hbm"`, `pinned=True`.
|
||||
- `tl.store(ptr, handle) -> None`: TCM → HBM. `DmaWriteCmd` 발행. greenlet
|
||||
모드에서는 `handle.data` 가 있을 때만 `_store.write("hbm", ptr, data)` 를
|
||||
먼저 호출 (visibility = issue time, ADR-0020 D3).
|
||||
|
||||
#### D3.3. GEMM / compute (blocking)
|
||||
|
||||
- `tl.dot(a, b) -> TensorHandle`: `a @ b`. 두 피연산자는 TCM 이어야 하며,
|
||||
shape (M,K) × (K,N) → (M,N). `GemmCmd` 발행, output handle 은
|
||||
`_make_compute_out(shape, dtype)` 로 PE-로컬 scratch 에 할당.
|
||||
- `tl.composite(op, a, b=None, out_ptr=0, math_op=None, epilogue=None,
|
||||
acc_dtype=None, tile_shape=None) -> CompletionHandle`: 비차단(non-blocking)
|
||||
tiled pipeline. `CompositeCmd` 발행. `epilogue` 는 dict list, 각 dict 는
|
||||
`"op"` 키 + op-specific 필드 + 옵션 `"scope"` (k_tile / output_tile);
|
||||
unknown op 나 missing field 는 즉시 ValueError. 반환된 CompletionHandle 은
|
||||
`tl.wait(h)` 로 동기화.
|
||||
|
||||
#### D3.4. Math: unary (blocking)
|
||||
|
||||
- `tl.exp(x)`, `tl.log(x)`, `tl.sqrt(x)`, `tl.abs(x)`, `tl.sigmoid(x)`,
|
||||
`tl.cos(x)`, `tl.sin(x)` — 모두 `MathCmd(op=<name>, inputs=(x,), out=)`
|
||||
발행. `out` 은 동일 shape/dtype 의 scratch 할당.
|
||||
|
||||
#### D3.5. Math: binary (blocking)
|
||||
|
||||
- `tl.maximum(a, b)`, `tl.minimum(a, b)` — `_binary_math`.
|
||||
- `tl.fma(a, b, c)` — `a*b + c`. inputs 3개.
|
||||
- `tl.clamp(x, min, max)` — `MathCmd(op="clamp", inputs=(x, min, max))`.
|
||||
- `tl.where(cond, a, b)` — `MathCmd(op="where", inputs=(cond, a, b))`.
|
||||
- `tl.softmax(x, axis=-1)` — 단일 MathCmd(op="softmax") 로 시간 회계는
|
||||
한 번에. Phase 2 DataExecutor 가 canonical (x-max → exp → sum → div) 로
|
||||
expand 한다.
|
||||
|
||||
#### D3.6. Reduction (blocking)
|
||||
|
||||
- `tl.sum(x, axis)`, `tl.max(x, axis)`, `tl.min(x, axis)` — 해당 axis 의
|
||||
크기를 1 로 줄인 output handle 을 반환. `MathCmd(op=<name>, inputs=(x,),
|
||||
out=, axis=axis)` 발행.
|
||||
|
||||
#### D3.7. Index / scalar (PE_CPU, no engine)
|
||||
|
||||
- `tl.program_id(axis=0) -> int`: `axis==0` → pe_id (cube-local PE 인덱스),
|
||||
`axis==1` → cube_id (ADR-0022).
|
||||
- `tl.num_programs(axis=0) -> int`: `axis==0` → num_programs (cube 당
|
||||
PE 수), `axis==1` → num_cubes.
|
||||
- `tl.arange(start, end, dtype="i32") -> TensorHandle`: TCM 의 인덱스
|
||||
range. command 발사 없이 metadata 만.
|
||||
- `tl.zeros(shape, dtype="f16") -> TensorHandle`, `tl.full(shape, value,
|
||||
dtype="f16") -> TensorHandle`: TCM 에 placeholder. command 발사 없음.
|
||||
|
||||
#### D3.8. Scalar helpers (no command, no engine)
|
||||
|
||||
- `TLContext.cdiv(a, b) -> int` (static): ceiling division
|
||||
`-(-a // b)`. real Triton 의 `tl.cdiv` 모방.
|
||||
|
||||
#### D3.9. Metadata-only (no compute, no DMA)
|
||||
|
||||
- `tl.trans(x) -> TensorHandle`: shape 의 마지막 두 dim 을 swap 한 새
|
||||
핸들. 같은 addr/data 를 공유, command 발사 없음.
|
||||
|
||||
#### D3.10. IPCQ (CCL) primitives (ADR-0023 D4)
|
||||
|
||||
- `tl.send(dir, src=None, *, src_addr=None, nbytes=None, shape=None,
|
||||
dtype="f16", space="tcm") -> None`: blocking send. handle 형태 또는
|
||||
raw 주소 형태 둘 다 허용. `IpcqSendCmd` 발행. handle 의 `.data` 스냅샷이
|
||||
명령에 실리는 경우, recv 측에서 받은 데이터의 race 회피.
|
||||
- `tl.recv(dir=None, shape=(), dtype="f16", space="tcm", dst_addr=None,
|
||||
dst_space=None) -> TensorHandle`: blocking recv. `dst_addr/dst_space`
|
||||
둘 다 주면 "copy_to_dst" 모드, 아니면 "return_slot" 모드. greenlet
|
||||
모드에서 핸들의 `.data` 에 실제 데이터 첨부.
|
||||
- `tl.recv_no_consume(dir=None, shape=(), dtype="f16") -> TensorHandle`:
|
||||
**DIAGNOSTIC ONLY**. recv blocking 동기화는 그대로 적용되나 slot-read
|
||||
latency (slot-IO + PE↔bank fabric drain) 는 건너뛴다. pe2pe overview
|
||||
플롯에서 `tl.store` 와의 apples-to-apples 비교용. production kernel 은
|
||||
사용 금지 — `consume=False` 라는 별도 명령 분기로 격리되어 있어 실수
|
||||
flag 가 작동하지 않는다.
|
||||
- `tl.recv_async(dir, shape=(), dtype="f16") -> RecvFuture`: non-blocking
|
||||
recv. `RecvFuture` 를 반환; 이후 `tl.wait(future)` 로 결과 수령.
|
||||
|
||||
#### D3.11. Composite + control
|
||||
|
||||
- `tl.composite(...)`: D3.3 에서 설명.
|
||||
- `tl.wait(handle=None)`: `CompletionHandle` (composite) 또는 `RecvFuture`
|
||||
(async recv) 또는 `None` (모든 pending composite) 대기.
|
||||
- `tl.cycles(n)`: PE_CPU scalar 실행 overhead 를 명시적으로 선언.
|
||||
`PeCpuOverheadCmd(cycles=n)` 발행.
|
||||
|
||||
### D4. TensorHandle 산술 연산자 — thread-local TLContext
|
||||
|
||||
`tl_context.py` 모듈 로드 시점에 `_enable_tensor_ops()` 가 호출되어
|
||||
`TensorHandle.__add__`, `__sub__`, `__mul__`, `__truediv__` 를 patch한다.
|
||||
각 연산자는 thread-local `_ctx` (모듈 변수) 에 저장된 active TLContext 의
|
||||
`_binary_math` 를 호출한다.
|
||||
|
||||
따라서 kernel 안에서 `c = a + b` 는 `MathCmd(op="add", inputs=(a,b),
|
||||
out=)` 발행 + new TensorHandle 반환 패턴과 동일하다.
|
||||
|
||||
active TLContext 관리:
|
||||
|
||||
- `TLContext._set_active(ctx)`: 현재 thread/greenlet 의 active ctx 설정.
|
||||
- `TLContext._get_active()`: 조회 (없으면 RuntimeError).
|
||||
- `run_kernel(kernel_fn, tl_ctx, *args, **kwargs)`: helper. 진입 시
|
||||
active 설정, kernel 실행, 종료 시 None 으로 복원.
|
||||
|
||||
`KernelRunner` 는 매 cmd 분배 시 `_switch_kernel` 가 직접 `_set_active(tl)`
|
||||
를 호출하여, 같은 thread 안의 다른 PE runner 가 active 를 덮어쓴 경우에도
|
||||
복원되도록 한다.
|
||||
|
||||
### D5. Scratch allocator — compute output handles
|
||||
|
||||
`tl.dot`, `tl.exp`, `tl.add` (TensorHandle `__add__`) 등 결과를 만드는 op 는
|
||||
`_make_compute_out(shape, dtype)` 를 호출하여 16-byte aligned scratch
|
||||
주소를 할당한다. 이 주소는 `space="tcm"` 로 발행되며, 이후 `tl.send` /
|
||||
`tl.store` 가 이 handle 을 source 로 사용할 수 있다.
|
||||
|
||||
`_scratch_base == 0` (command-list 모드 등) 이면 할당 주소가 0으로
|
||||
반환되어 handle 은 send/store 의 source 로 사용 불가 (이 경우 `tl.load`
|
||||
로 받은 핸들만 source 가 될 수 있다).
|
||||
|
||||
cursor 가 `_scratch_size` (default 1 MiB) 를 초과하면 RuntimeError.
|
||||
cursor 는 매 kernel invocation 시작 시 0 으로 리셋되어야 하나 (현재 코드는
|
||||
KernelRunner 가 새 TLContext 를 매번 생성하여 자연스럽게 리셋됨).
|
||||
|
||||
### D6. Dispatch overhead — `PeCpuOverheadCmd(dispatch_cycles)`
|
||||
|
||||
모든 non-metadata `tl.*` 호출의 entry 에서 `_emit_dispatch_overhead()` 가
|
||||
호출되며 `dispatch_cycles > 0` 일 때 `PeCpuOverheadCmd(dispatch_cycles)`
|
||||
를 발행한다. PE_CPU 가 명령 dispatch 자체에 소비하는 cycle 비용을
|
||||
모델링하기 위함이다.
|
||||
|
||||
기본값:
|
||||
|
||||
- `TLContext.__init__` 의 `dispatch_cycles` 매개변수 기본값: 1 cycle.
|
||||
- `KernelRunner` 가 만드는 TLContext: 0 cycles (greenlet 모드는 cycle
|
||||
회계가 별도, ADR-0020 D3 정신).
|
||||
|
||||
### D7. Kernel registry (`triton_emu/registry.py`)
|
||||
|
||||
별도의 `_kernels: dict[str, Callable]` 가 kernel 이름 → 함수 매핑을 보유:
|
||||
|
||||
- `register_kernel(name, fn)`: duplicate 등록 시 ValueError.
|
||||
- `get_kernel(name)`: 미등록 시 KeyError.
|
||||
- `clear_registry()`: 테스트 전용.
|
||||
|
||||
`RuntimeContext.launch(kernel_name, kernel_fn, *args)` 가 매 호출마다
|
||||
`_kernels[kernel_name] = kernel_fn` 으로 idempotent 덮어쓴다 (last call
|
||||
wins). 이는 ADR-0045 D8 의 launch 동작과 정합된다.
|
||||
|
||||
PE_CPU 는 `KernelRef.name` 으로 registry 에서 kernel 함수를 lookup 한 뒤
|
||||
KernelRunner 로 실행한다.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. tl.* 를 ADR-0014 / ADR-0020 안으로 통합
|
||||
|
||||
기각. ADR-0014 는 PE pipeline (PeCommand 의 sim_engine 측 소비) 를, ADR-0020
|
||||
은 2-pass 실행 (Phase 1 timing / Phase 2 data) 을 다룬다. `tl.*` 는 kernel
|
||||
작성자가 만나는 API 표면이라 독립 분리하는 것이 검색성·온보딩 측면에서
|
||||
낫다.
|
||||
|
||||
### A2. command-list 모드 deprecation
|
||||
|
||||
기각 (현재). 단순한 unit test 와 kernel verification 에서 command-list
|
||||
모드가 가볍게 동작한다. greenlet 의존성 없이 PeCommand 시퀀스를 검사할 수
|
||||
있는 출입구로 유지한다. greenlet 모드만의 의미 (실데이터, Phase 2) 가
|
||||
필요하면 D2 의 mode 선택으로 명시적으로 들어간다.
|
||||
|
||||
### A3. TensorHandle 산술 연산자 제거
|
||||
|
||||
기각. real Triton 의 kernel 코드 가독성을 흉내내기 위함이며 (예: `c = a +
|
||||
b`), thread-local active ctx 패턴이 깔끔하게 작동 중. 명시적 `tl.add(a, b)`
|
||||
도 D3.5 에 노출되어 있어, 연산자가 헷갈리면 함수형 호출로 대체 가능.
|
||||
|
||||
### A4. softmax 를 명시적 시퀀스 (max → exp → sum → div) 로 expand
|
||||
|
||||
부분 채택. `tl.softmax` 는 단일 `MathCmd(op="softmax")` 로 timing 회계는
|
||||
한 번에 처리한다 (D3.5). 실 데이터 expansion 은 Phase 2 DataExecutor 가
|
||||
canonical 시퀀스로 풀어준다. 즉, 시간 모델은 atomic, 데이터 모델은
|
||||
expansion — 두 마리 토끼를 의도적으로 분리.
|
||||
|
||||
## Consequences
|
||||
|
||||
- bench 작성자가 만나는 모든 `tl.*` primitive 가 한 ADR 에 분류·정의됨.
|
||||
ADR-0045 D8 의 host-side surface (torch.empty 등) 와 짝을 이루어 "kernel
|
||||
안 / 밖" 양쪽 작성 가이드가 완성.
|
||||
- command-list / greenlet 두 모드의 차이가 D2 에 명시되어, 새로운 `tl.*`
|
||||
primitive 추가 시 `_emit()` 패턴만 따르면 양쪽 자동 호환됨.
|
||||
- thread-local active ctx 패턴 (D4) 이 ADR-level 에서 정당화되어, 향후
|
||||
multi-PE 동일-thread 실행 시 reset 책임이 어디인지 명확해짐
|
||||
(`_switch_kernel` 가 cmd 분배 시 active 복원 — KernelRunner.run 의
|
||||
contract).
|
||||
- `tl.recv_no_consume` 의 진단 전용 격리(D3.10) 가 ADR 에 굳어져, 실수로
|
||||
production kernel 에서 사용되는 것을 막는 layer 가 명확.
|
||||
- registry (D7) 가 별도 D 항목으로 분리되어, kernel 이름 충돌 / 동적
|
||||
재등록 동작의 사양이 명시.
|
||||
@@ -0,0 +1,243 @@
|
||||
# ADR-0047: AHBM CCL Backend — `torch.distributed`-compat shim
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
`runtime_api/distributed.py` 의 `AhbmCCLBackend` + `DistributedContext` —
|
||||
즉 `torch.distributed.init_process_group(backend="ahbm")` 진입점이 실제로
|
||||
무엇을 설치하고 어떤 의미로 `all_reduce`/`barrier`/`get_rank` 등을
|
||||
구현하는지를 명시한다. ADR-0023 D11 이 "torch.distributed compatibility"
|
||||
의도를 언급하나, **backend 자체의 동작 모델**은 ADR-level 에 없었다.
|
||||
|
||||
## First action (제일 처음에 하는 일)
|
||||
|
||||
`RuntimeContext.__post_init__` 가 자동으로 `DistributedContext()` 인스턴스를
|
||||
만들어 `self.distributed` 에 attach 한다. 그 시점의 첫 일은:
|
||||
|
||||
1. `self._backend: AhbmCCLBackend | None = None` 으로 초기화 (아직 init
|
||||
되지 않은 상태).
|
||||
2. `self._rank_by_greenlet: dict = {}` 로 greenlet-local rank 레지스트리
|
||||
초기화 (ADR-0024 D2).
|
||||
3. 호출자(RuntimeContext) 측에서 `dc._ctx_ref = self` 로 back-reference 를
|
||||
심어, 이후 `init_process_group` 가 `ctx.engine` / `ctx.spec` / `ctx.launch`
|
||||
에 도달할 수 있게 한다.
|
||||
|
||||
즉, **DistributedContext 의 첫 일은 "RuntimeContext 에 자기 자신을
|
||||
back-reference 와 함께 부착하고 backend 슬롯을 비워두는 것"**. 실제 backend
|
||||
설치(IPCQ install, world_size 산출, 알고리즘 모듈 로드)는 사용자 코드의
|
||||
`torch.distributed.init_process_group(backend="ahbm")` 호출 시점에 비로소
|
||||
일어난다.
|
||||
|
||||
해당 시점의 `init_process_group` 의 첫 일은:
|
||||
|
||||
1. `backend != "ahbm"` 이면 즉시 `ValueError("Unsupported backend ...")`.
|
||||
2. `getattr(self, "_ctx_ref", None)` 가 None 이면
|
||||
`RuntimeError("DistributedContext not bound to a RuntimeContext")`.
|
||||
3. `self._backend = AhbmCCLBackend(torch_ctx=ctx)` — 이 생성자 안에서
|
||||
ccl.yaml load + 알고리즘 모듈 import + world_size 산출 + SFR 설정 +
|
||||
IPCQ install 이 모두 일어난다.
|
||||
4. `self._backend._dist_ctx = self` — backend 가 거꾸로
|
||||
`_rank_by_greenlet` 에 접근할 수 있게 함.
|
||||
|
||||
## Context
|
||||
|
||||
PyTorch DDP 의 collective 호출 (`init_process_group`, `all_reduce` 등) 을
|
||||
그대로 사용할 수 있게 만들어, bench 코드가 "진짜 DDP training script" 와
|
||||
동일한 모습이 되도록 하는 것이 `AhbmCCLBackend` 의 목적이다 (ADR-0024 +
|
||||
ADR-0027 의 launcher 모델과 정렬).
|
||||
|
||||
이 backend 가 책임지는 것:
|
||||
|
||||
- `init_process_group` 시점에 **IPCQ neighbor table 을 한 번 설치** (real
|
||||
NCCL communicator creation 과 유사).
|
||||
- `all_reduce(tensor, op="sum")` 호출 시 **설정된 algorithm 의 kernel 함수
|
||||
를 `ctx.launch(...)` 로 발사**.
|
||||
- `get_world_size` / `get_rank` 를 greenlet-local rank 레지스트리와
|
||||
ccl.yaml/topology 로부터 일관되게 답함.
|
||||
|
||||
ADR-0023 D10 (IPCQ install plan), ADR-0024 (SIP launcher) 가 부분적으로
|
||||
이를 다루나, **`AhbmCCLBackend` 자체의 책임 범위와 의사결정 순서**는
|
||||
어디에도 명시되어 있지 않다. 본 ADR 이 채운다.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. backend 는 `init_process_group(backend="ahbm")` 시점에만 생성된다
|
||||
|
||||
`DistributedContext` 는 `__init__` 시점에 `_backend = None` 으로 시작한다.
|
||||
backend 객체는 사용자가 `dist.init_process_group(backend="ahbm")` 를
|
||||
호출하기 전까지 존재하지 않으며, 그 외 API (`is_initialized`,
|
||||
`get_world_size`, `all_reduce`, `barrier`) 가 backend 가 None 인 채로
|
||||
호출되면 `RuntimeError("Default process group has not been initialized...")`
|
||||
를 던진다 (`_ensure_initialized` 헬퍼).
|
||||
|
||||
`backend != "ahbm"` 은 즉시 `ValueError`. 다른 backend 명 (nccl, gloo
|
||||
등) 은 인식하지 않는다.
|
||||
|
||||
### D2. world_size 산출 우선순위 — algorithm > defaults > topology
|
||||
|
||||
`AhbmCCLBackend._resolve_world_size` (ADR-0024 D1) 의 결정 순서:
|
||||
|
||||
1. `ccl.yaml` 의 algorithm entry 에 `world_size` 가 있으면 그 값.
|
||||
2. `defaults.world_size` 가 있으면 그 값.
|
||||
3. 둘 다 없으면 `spec.system.sips.count` (=topology 의 SIP 개수).
|
||||
|
||||
기본 의미는 **rank = SIP** (ADR-0024). cube/PE-level parallelism 은 각
|
||||
rank 안에서 DPPolicy 로 표현되며 world_size 에 영향을 주지 않는다. 명시적
|
||||
`ccl.yaml` 의 world_size override 가 있으면 legacy "rank = flat PE 인덱스"
|
||||
테스트 경로를 위해 그대로 존중된다.
|
||||
|
||||
`init_process_group(world_size=..., rank=...)` 의 사용자 인자는 **수신하나
|
||||
무시**된다 (real PyTorch 의 `RANK` / `WORLD_SIZE` env var 와 같은 의미).
|
||||
|
||||
### D3. `init_process_group` 가 즉시 하는 4가지 설치 작업
|
||||
|
||||
`AhbmCCLBackend.__init__` 안에서 다음이 순차 실행된다:
|
||||
|
||||
1. **ccl.yaml 로딩**: `kernbench.ccl.install.load_ccl_config()` →
|
||||
`resolve_algorithm_config(_cfg_all)` 로 `defaults.algorithm` (또는
|
||||
사용자가 지정한 알고리즘) 의 merged config 산출.
|
||||
2. **알고리즘 모듈 import**: `importlib.import_module(self._merged["module"])`.
|
||||
이 모듈은 `kernel` 함수, `kernel_args(world_size, n_elem, cube_w, cube_h)`
|
||||
helper, optional `TOPO_NAME_TO_KIND` 매핑을 노출해야 한다.
|
||||
3. **world_size 산출** (D2).
|
||||
4. **topology 메타 수집**: `spec` 으로부터 `n_sips`, `sip_topo` (`ring_1d`
|
||||
기본), `cube_w`/`cube_h`, `sips.w`/`sips.h`. SIP topology 가 ring_1d 가
|
||||
아니면 explicit `w`/`h` 또는 square root 로 (`w*h == n_sips` 보장)
|
||||
`_sip_topo_w/h` 산출. 불일치 시 `ValueError`.
|
||||
5. **SFR + IPCQ 설치**: `kernbench.ccl.sfr_config.configure_sfr_intercube_multisip
|
||||
(engine, spec, self._merged)` 를 호출. 이 함수가 모든 SIP/cube 의 pe0 에
|
||||
IPCQ neighbor table 을 푸시 (real NCCL communicator 의 일회성 설정에
|
||||
해당).
|
||||
|
||||
이 순서가 변하면 (예: SFR 전에 algorithm 모듈 load 가 실패하면) 부분 초기화
|
||||
상태가 발생할 수 있다. 따라서 D3 는 atomic 한 4-단계로 본다 — 실패 시
|
||||
backend 는 미설치 상태로 남는다.
|
||||
|
||||
### D4. greenlet-local rank 등록 (ADR-0024 D2)
|
||||
|
||||
`DistributedContext._rank_by_greenlet: dict[greenlet, int]` 은 spawn 된
|
||||
worker greenlet 각각에 rank 를 매핑한다. bench launcher (예:
|
||||
`torch.multiprocessing.spawn`) 가 worker 를 띄울 때
|
||||
`dc._bind_rank(g, rank)` 를 호출하여 등록한다.
|
||||
|
||||
`get_rank()` 는 `getcurrent()` 의 greenlet 을 lookup. 미등록 greenlet은
|
||||
fallback 으로 0 을 반환 — single-driver / 테스트 호환성 유지.
|
||||
|
||||
backend 는 `_dist_ctx._rank_by_greenlet` 를 통해 `all_reduce` 시 현재
|
||||
greenlet 의 rank 를 가져온다 (D5).
|
||||
|
||||
### D5. `all_reduce(tensor, op="sum")` 동작
|
||||
|
||||
검증 단계:
|
||||
|
||||
- `op != "sum"` → `NotImplementedError`. 현재 kernel 들은 add reduction만 구현.
|
||||
- `tensor._handle is None` → `RuntimeError("not deployed")`.
|
||||
- `tensor._handle.shards` 가 비면 `RuntimeError("no shards")`.
|
||||
|
||||
준비 단계:
|
||||
|
||||
- `n_elem = shards[0].nbytes // tensor.itemsize` — 단일 shard 의 element 수.
|
||||
- `kernel_fn = self._algo_module.kernel` — D3 에서 import 된 알고리즘 모듈의
|
||||
진입 함수.
|
||||
- effective cube dims 결정: 첫 번째 SIP 의 cube 갯수가 1 이면 (1,1) 으로
|
||||
scalar 처리, 아니면 토폴로지의 `cube_w`/`cube_h` 사용. TP 가 일부 cube
|
||||
만 쓰는 경우를 자연스럽게 흡수.
|
||||
- `kernel_args = self._algo_module.kernel_args(world_size, n_elem, cube_w,
|
||||
cube_h)` — 알고리즘이 자기 kernel 에 넘길 인자 셋을 결정.
|
||||
|
||||
dispatch:
|
||||
|
||||
- 현재 greenlet 의 rank 를 `_rank_by_greenlet.get(g, 0)` 로 lookup.
|
||||
- `extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` 를 append.
|
||||
- `pending = self.ctx.launch(algorithm_name, kernel_fn, tensor, *kernel_args,
|
||||
*extra_args, _defer_wait=True)` — `_defer_wait=True` 로 collective drain
|
||||
을 메인 scheduler 에 위임 (ADR-0027 D0.4).
|
||||
|
||||
drain:
|
||||
|
||||
- 부모 greenlet 이 살아있으면 (multi-greenlet 모드) `_pending_collective_handles`
|
||||
에 enqueue 한 뒤 부모로 switch. 메인 scheduler 가 모든 rank 의 launch 후
|
||||
일괄 drain.
|
||||
- 단일-driver 모드면 inline 으로 `for h, _sip_id, meta in pending:
|
||||
self.ctx.wait(h, _meta=meta)` 즉시 drain.
|
||||
|
||||
### D6. `barrier()` 는 no-op 이다 (single-driver 모델)
|
||||
|
||||
kernbench 는 하나의 Python process 안에서 모든 rank 를 greenlet 으로 다룬다.
|
||||
process 간 동기화가 필요한 상황이 없으므로 `barrier()` 는 호출 가능하지만
|
||||
실제 어떤 동기화도 수행하지 않는다. real PyTorch DDP 와의 API 호환성을
|
||||
위해 유지 (호출자가 NotImplementedError 를 받지 않도록).
|
||||
|
||||
장래에 multi-process kernbench (예: SimPy event loop 가 process 별로
|
||||
독립) 가 도입되면 D6 를 supersede 하는 새 ADR 이 필요.
|
||||
|
||||
### D7. `get_rank` / `get_world_size` / `get_backend` 의 의미
|
||||
|
||||
- `get_rank()` (D4): 현재 greenlet 의 bound rank. 미등록은 0.
|
||||
- `get_world_size()` (D2): backend 가 D3 에서 산출한 world_size.
|
||||
- `get_backend()`: 항상 `"ahbm"` 문자열. backend 객체가 존재하지 않으면
|
||||
`_ensure_initialized` 에서 RuntimeError.
|
||||
|
||||
real PyTorch 와의 차이:
|
||||
|
||||
- real PyTorch `get_rank()` 는 process global 값이지만, kernbench 는
|
||||
greenlet-local. spawn 된 worker 안에서 호출하면 rank, main thread 에서
|
||||
호출하면 0. bench 작성자는 worker 함수 안에서만 의미 있는 rank 를 기대해야
|
||||
한다.
|
||||
|
||||
### D8. 지원하는 API 표면 (final)
|
||||
|
||||
`DistributedContext` 가 노출하는 API:
|
||||
|
||||
- `init_process_group(backend="ahbm", world_size=None, rank=None, **kwargs)`
|
||||
- `is_initialized() -> bool`
|
||||
- `get_world_size() -> int`
|
||||
- `get_rank() -> int`
|
||||
- `get_backend() -> str`
|
||||
- `all_reduce(tensor, op="sum") -> None`
|
||||
- `barrier() -> None`
|
||||
- (internal) `_bind_rank(g, rank)`
|
||||
|
||||
이외의 PyTorch distributed API (broadcast, reduce, all_gather, gather,
|
||||
scatter, send/recv 등) 는 **아직 구현되어 있지 않다**. kernel 레벨에서는
|
||||
`tl.send`/`tl.recv` (ADR-0046 D3.10) 로 직접 표현 가능하나, dist.* surface
|
||||
로는 노출되지 않는다. 추가 collective 가 필요해질 시 별도 알고리즘 모듈
|
||||
+ `DistributedContext` 메소드 한 쌍을 추가하여 D8 를 확장한다.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. backend 를 `RuntimeContext.__init__` 에서 즉시 생성
|
||||
|
||||
기각. ccl.yaml 이 없거나 알고리즘 모듈을 import 할 수 없는 경우, bench 가
|
||||
distributed 기능을 안 쓰는데도 RuntimeContext 생성 자체가 실패하게 된다.
|
||||
"호출 시점에 비로소 설치" (D1) 가 lazy 의미상 옳다.
|
||||
|
||||
### A2. world_size 를 항상 topology 로부터 자동 산출 (override 금지)
|
||||
|
||||
기각. ADR-0024 D1 의 "explicit override" 경로가 legacy 테스트에서 사용 중.
|
||||
한 SIP 안에서 PE-level rank 를 따로 정의해야 하는 진단 시나리오를 위해
|
||||
유지.
|
||||
|
||||
### A3. `op != "sum"` 을 silent fallback 으로 처리
|
||||
|
||||
기각. 사용자가 `op="prod"` / `"max"` / `"avg"` 를 의도했는데 silently sum
|
||||
이 실행되면 결과 검증이 매우 어렵다. 명시적 `NotImplementedError` 가 안전.
|
||||
|
||||
### A4. `barrier` 를 SimPy event 로 구현
|
||||
|
||||
기각 (현재). single-driver 모델에서 cross-process 동기화 의미가 없으므로
|
||||
no-op 가 의미적으로 정확. SimPy fake-barrier 는 의미 없이 코드 복잡도만
|
||||
높임. multi-process kernbench 도입 시 재평가.
|
||||
|
||||
## Consequences
|
||||
|
||||
- `torch.distributed.init_process_group(backend="ahbm")` 의 4-단계 설치
|
||||
(D3) 가 ADR-level 에서 굳어져, 향후 새 collective 알고리즘이 어디에
|
||||
훅을 걸어야 하는지 명확.
|
||||
- D2 의 우선순위 (algorithm > defaults > topology) 가 명시되어, ccl.yaml
|
||||
변경 시 영향 범위를 빠르게 가늠 가능.
|
||||
- D6 의 barrier no-op 결정이 ADR-level 에 굳어져, multi-process kernbench
|
||||
도입 시 별도 ADR 로 supersede 해야 함이 분명.
|
||||
- D8 의 미지원 API 목록이 명시되어, 사용자가 `dist.broadcast(...)` 를
|
||||
호출하려 할 때의 명확한 거절 근거 제공.
|
||||
@@ -0,0 +1,262 @@
|
||||
# ADR-0048: Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
`policy/address/allocator.py` 의 `_FreeList` / `PEMemAllocator` 와
|
||||
`va_allocator.py` 의 `VirtualAllocator` 가 사용하는 free-list 알고리즘,
|
||||
페이지 정렬, coalescing 규칙을 명시한다. ADR-0001 (PhysAddr 레이아웃) 과
|
||||
ADR-0011 (PA/VA/LA 모델) 이 주소 스킴을 정의하나, **할당 알고리즘**은 별도
|
||||
ADR 이 없었다.
|
||||
|
||||
## First action (제일 처음에 하는 일)
|
||||
|
||||
### `_FreeList(capacity)`
|
||||
|
||||
생성 즉시 `self._capacity = capacity`, `self._used = 0`, `self._free =
|
||||
[(0, capacity)]` 로 초기화. 첫 일은 **전 영역을 single free block 으로
|
||||
세우는 것** — 즉 `(offset=0, size=capacity)` 한 튜플이 free list 의 유일한
|
||||
원소다.
|
||||
|
||||
### `PEMemAllocator(sip_id, die_id, pe_id, cfg)`
|
||||
|
||||
생성 즉시 두 개의 `_FreeList` 를 만든다:
|
||||
|
||||
- `self._hbm = _FreeList(cfg.hbm_slice_bytes)` — 이 PE 가 소유한 HBM
|
||||
slice 의 바이트 크기 (`hbm_bytes_per_cube // hbm_slices_per_cube`) 만큼.
|
||||
- `self._tcm = _FreeList(cfg.tcm_allocatable_bytes)` — `tcm_bytes_per_pe -
|
||||
tcm_scheduler_reserved_bytes` 만큼 (scheduler 예약분은 사전 분리).
|
||||
|
||||
따라서 PEMemAllocator 의 첫 일은 **이 PE 의 HBM slice 와 사용자
|
||||
TCM 영역을 각각 단일 free block 으로 세우는 것**.
|
||||
|
||||
### `VirtualAllocator(va_base, va_size, page_size=2*1024*1024)`
|
||||
|
||||
생성 즉시 `self._va_base = va_base`, `self._va_size = va_size`,
|
||||
`self._page_size = page_size`, `self._used = 0`, `self._free = [(va_base,
|
||||
va_size)]`. 첫 일은 **VA base 부터 size 까지 single block 으로 세우고
|
||||
page_size 를 회수**.
|
||||
|
||||
## Context
|
||||
|
||||
`runtime_api/context.py::_ensure_allocators` 는 다음 단계로 allocator 세트를
|
||||
구성한다:
|
||||
|
||||
1. spec 으로부터 `hbm_total_gb_per_cube`, `hbm_slices_per_cube`,
|
||||
`tcm_size_mb`, target_device 별 SIP 범위 등을 읽음.
|
||||
2. `AddressConfig` 로 모든 파라미터를 frozen 하게 패킹.
|
||||
3. target SIP 범위 × cube × PE 의 모든 조합에 대해
|
||||
`PEMemAllocator(sip, cube, pe, cfg)` 인스턴스를 1개씩 생성.
|
||||
4. `VirtualAllocator(va_base=0x1_0000_0000, va_size=64 GiB,
|
||||
page_size=pe_mmu.page_size)` 를 1개 생성.
|
||||
|
||||
allocator 들의 책임:
|
||||
|
||||
- **PEMemAllocator**: PE-로컬 HBM slice / TCM 의 PA-공간 할당 (PhysAddr
|
||||
encoding 까지 포함).
|
||||
- **VirtualAllocator**: device-wide VA 공간을 페이지 정렬로 할당. 이후
|
||||
`RuntimeContext._create_tensor` 가 VA → PA 매핑을 `MmuMapMsg` 로 fabric
|
||||
에 push.
|
||||
|
||||
이 알고리즘들은:
|
||||
|
||||
- **first-fit** 으로 단순.
|
||||
- 자유 블록 리스트는 **offset 정렬 (sorted by start)** 유지.
|
||||
- `free()` 시 **양쪽 인접 블록과 coalesce**.
|
||||
|
||||
이런 결정의 근거가 어디에도 없으므로, 향후 누군가 "왜 best-fit 이 아닌가",
|
||||
"왜 buddy allocator 가 아닌가", "왜 partial overlap free 가 silently
|
||||
허용되는가" 라는 질문에 답할 기준이 필요. 본 ADR 이 그 기준을 마련한다.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. `_FreeList` — offset-기반 first-fit + coalescing
|
||||
|
||||
`policy/address/allocator.py::_FreeList`:
|
||||
|
||||
- 내부 표현: `list[tuple[int, int]]` = `[(start_offset, size), ...]` —
|
||||
start offset 으로 정렬된 자유 블록의 sorted list.
|
||||
- `alloc(nbytes)`:
|
||||
1. free list 를 앞에서부터 순회 (first-fit).
|
||||
2. 처음 만나는 `size >= nbytes` 인 블록에서 앞부분을 잘라 사용.
|
||||
3. 정확히 일치하면 블록 통째로 제거; 아니면 `(start+nbytes, size-nbytes)`
|
||||
로 축소.
|
||||
4. `_used += nbytes`, 잘라낸 `start` 반환.
|
||||
5. 맞는 블록이 없으면 `AllocationError("overflow ... largest free block
|
||||
...")`.
|
||||
- `free(offset, nbytes)`:
|
||||
1. `_used -= nbytes`.
|
||||
2. `bisect_left(self._free, (offset,))` 로 삽입 위치 결정.
|
||||
3. 직전 블록과 인접 (`prev_start + prev_size == offset`) 하면 흡수.
|
||||
4. 직후 블록과 인접 (`offset+nbytes == next_start`) 하면 흡수.
|
||||
5. coalesced range 를 정렬 위치에 insert.
|
||||
|
||||
이 알고리즘은 fragmentation 에 약점이 있으나 (best-fit / buddy 대비), 본
|
||||
시뮬레이터의 워크로드 특성상 (deploy/free 패턴이 거의 stack-like) 충분
|
||||
하다는 것이 디자인 가정이다. 워크로드가 변하면 D1 supersede 후보.
|
||||
|
||||
### D2. partial overlap free 는 **검사하지 않는다**
|
||||
|
||||
`_FreeList.free(offset, nbytes)` 는 호출자가 정확한 (offset, nbytes) 를
|
||||
넘긴다고 신뢰한다. 다음을 검증하지 않는다:
|
||||
|
||||
- 그 range 가 실제로 alloc 된 것인지.
|
||||
- 그 range 가 다른 alloc 된 영역과 겹치지 않는지.
|
||||
|
||||
이유: 시뮬레이터 컨텍스트에서 호출자는 항상 `alloc()` 의 반환값을 그대로
|
||||
저장했다가 `free()` 에 넘기는 패턴이며, 외부 사용자 입력이 아니다. 안전성
|
||||
검사를 추가하면 매 free 마다 O(N) 비용이 들어 시뮬 wall-clock 에 영향.
|
||||
|
||||
이 신뢰 모델이 깨지면 (예: 두 텐서가 같은 PA 를 가리키는 코드 경로 도입)
|
||||
즉시 ADR-level 으로 재검토.
|
||||
|
||||
### D3. `PEMemAllocator` — HBM/TCM 두 채널 분리
|
||||
|
||||
`PEMemAllocator(sip_id, die_id, pe_id, cfg)` 는 두 `_FreeList` 를 보유:
|
||||
|
||||
- `_hbm`: `cfg.hbm_slice_bytes` 크기.
|
||||
- `_tcm`: `cfg.tcm_allocatable_bytes` (= `tcm_bytes_per_pe -
|
||||
tcm_scheduler_reserved_bytes`) 크기.
|
||||
|
||||
`alloc_hbm(nbytes) -> PhysAddr`:
|
||||
|
||||
- `_hbm.alloc(nbytes)` 로 offset 획득.
|
||||
- `PhysAddr.pe_hbm_addr(sip_id, die_id, pe_id, pe_local_hbm_offset=offset,
|
||||
slice_size_bytes=cfg.hbm_slice_bytes)` 로 PA 인코딩.
|
||||
- 실패 시 `AllocationError("HBM overflow ...")`.
|
||||
|
||||
`free_hbm(pa, nbytes)`:
|
||||
|
||||
- `pa.hbm_offset - pe_id * cfg.hbm_slice_bytes` 로 PE-local offset 복원.
|
||||
- `_hbm.free(offset, nbytes)`.
|
||||
|
||||
`alloc_tcm(nbytes) -> PhysAddr`: 유사하게 `PhysAddr.pe_tcm_addr` 로 인코딩.
|
||||
|
||||
`free_tcm(pa, nbytes)`: `pa.sub_offset` 을 그대로 사용 (TCM 은 PE-local
|
||||
offset 이 곧 sub_offset).
|
||||
|
||||
scheduler-reserved TCM 영역 (`cfg.tcm_scheduler_reserved_bytes`) 은
|
||||
allocator 가 인지하지 않는다 (`_tcm` 의 capacity 에서 사전 차감되어 있음).
|
||||
이는 ADR-0014 의 PE_SCHEDULER 내부 buffer 예약과 정합된다.
|
||||
|
||||
### D4. `VirtualAllocator` — 페이지 정렬 first-fit + coalescing
|
||||
|
||||
`policy/address/va_allocator.py::VirtualAllocator`:
|
||||
|
||||
- 내부 표현: `_FreeList` 와 동일한 sorted `list[tuple[int, int]]`.
|
||||
최초: `[(va_base, va_size)]`.
|
||||
- `_align_up(nbytes) = ceil(nbytes / page_size) * page_size`.
|
||||
- `alloc(nbytes) -> int`:
|
||||
1. `aligned = _align_up(nbytes)`.
|
||||
2. first-fit 으로 `size >= aligned` 인 블록 탐색.
|
||||
3. 블록 앞부분 `aligned` 만큼 잘라 사용. 정확히 일치하면 제거.
|
||||
4. `_used += aligned`. 블록 `start` (= aligned 된 VA) 반환.
|
||||
5. 실패 시 `VaAllocationError`.
|
||||
- `free(va, nbytes)`: `_align_up(nbytes)` 단위로 free. _FreeList 와 동일한
|
||||
coalesce 알고리즘.
|
||||
|
||||
`page_size` 의 실제 값은 두 곳에서 다른 기본을 갖는다:
|
||||
|
||||
- `VirtualAllocator.__init__` 의 매개변수 기본값: `2 MiB`. 직접 호출하는
|
||||
테스트가 그대로 받는다.
|
||||
- `RuntimeContext._ensure_allocators` 가 인스턴스화할 때:
|
||||
`pe_mmu.attrs.get("page_size", 4096)` — `topology.yaml` 의
|
||||
`pe_mmu.attrs.page_size` 가 있으면 그 값, 없으면 fallback 4 KiB.
|
||||
|
||||
두 기본이 다른 이유: VirtualAllocator 의 standalone 기본은 ADR-0039 의
|
||||
PE_MMU stopgap 기본 (2 MiB) 과 정합되어 직접 테스트가 자연스럽고, context
|
||||
fallback 의 4 KiB 는 topology 미설정 시 안전한 minimum page 다. 실제 사용
|
||||
경로는 항상 후자이며 (`_ensure_allocators` 가 인스턴스화하므로),
|
||||
`topology.yaml` 에서 `page_size` 가 명시되면 그 값이 양쪽 (MMU + VA
|
||||
allocator) 으로 일관되게 흐른다.
|
||||
|
||||
만약 이 일치가 깨지면 (예: VirtualAllocator 의 page_size 를 PE_MMU 와
|
||||
다르게 인스턴스화) MMU `map()` 가 서브-페이지 region 모드 (ADR-0039 D3) 로
|
||||
흐른다.
|
||||
|
||||
VA 기본 범위: `va_base = 0x1_0000_0000` (= 4 GiB), `va_size = 64 GiB`. 이
|
||||
값은 `_ensure_allocators` 에 하드코딩되어 있으며 ADR-0011 의 VA 모델에서
|
||||
직접적인 의미를 갖지는 않는다 — 단지 host 코드와 충돌하지 않을 만큼 큰
|
||||
주소 공간을 device-wide 로 잡아둔 것.
|
||||
|
||||
### D5. allocator 인스턴스의 lifecycle
|
||||
|
||||
- `RuntimeContext._ensure_allocators` 가 lazy 하게 호출됨 (`_create_tensor`
|
||||
의 첫 호출 시점).
|
||||
- 한 번 생성된 allocator dict (`self._allocators`) 는 RuntimeContext 의
|
||||
lifetime 동안 재사용. 같은 process 안의 두 번째 deploy 는 새 객체를
|
||||
만들지 않는다.
|
||||
- `RuntimeContext.cleanup()` 이 모든 living tensor 의 `_free_tensor()` 를
|
||||
호출 → MMU unmap + `va_allocator.free` + `pemem_allocator.free_hbm` 으로
|
||||
free list 가 원상복구. 다음 RuntimeContext 가 다시 만들면 초기 상태부터.
|
||||
|
||||
allocator 상태가 RuntimeContext 간에 공유되지 않는 점이 단일 process 안의
|
||||
연속 실행에서 deploy → cleanup → deploy 의 결정성을 보장한다.
|
||||
|
||||
### D6. Allocator 실패는 raise 한다 (silent OOM 금지)
|
||||
|
||||
`_FreeList.alloc` / `VirtualAllocator.alloc` 모두 충분한 free block 이
|
||||
없으면 `AllocationError` / `VaAllocationError` 를 던진다. 메시지에는
|
||||
"required size + largest available block" 가 포함되어, fragmentation
|
||||
인지 진짜 OOM 인지 진단 가능.
|
||||
|
||||
silent fallback (예: 가장 큰 블록만큼만 alloc) 는 절대 금지 — 부분 할당된
|
||||
텐서가 SimPy 단계에 들어가면 라우팅·DMA 가 잘못된 PA 를 인지하여 시뮬
|
||||
정확도가 깨진다.
|
||||
|
||||
### D7. address space 와 allocator 의 1:1 대응
|
||||
|
||||
물리 주소 공간 분리는 PhysAddr 의 sub-unit (ADR-0001 D2.3) 으로 표현되며,
|
||||
각 sub-unit 마다 별도 allocator 인스턴스를 둔다:
|
||||
|
||||
- HBM slice → `PEMemAllocator._hbm`.
|
||||
- PE TCM → `PEMemAllocator._tcm`.
|
||||
- (현재 미사용) M_CPU local memory, CUBE SRAM → 별도 allocator 필요. 현재
|
||||
구현은 아직 IPCQ-only slot 으로 처리 (ADR-0023 D9.7) 하며 PA 공간을
|
||||
share 하지 않으므로 별도 free-list 가 없음.
|
||||
|
||||
cube-level SRAM allocator 가 필요해지면 `_FreeList(cfg.sram_bytes_per_cube)`
|
||||
인스턴스를 cube 단위로 추가한다 (`cfg.sram_bytes_per_cube` 는 이미
|
||||
`AddressConfig` 에 정의되어 있어 데이터 모델은 준비됨).
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. best-fit / buddy allocator
|
||||
|
||||
기각 (현재). 워크로드의 alloc/free 패턴이 stack-like (deploy 순서 = free
|
||||
순서) 라 first-fit + coalescing 으로 fragmentation 이 충분히 통제된다.
|
||||
LLM kernel sweep 에서 long-running fragmentation 이 관찰되면 buddy 로
|
||||
교체하는 ADR 을 별도로 만든다.
|
||||
|
||||
### A2. partial overlap free 검증 추가
|
||||
|
||||
기각. D2 의 신뢰 모델 + O(N) 검사 비용. 단, 디버그 모드 (`KERNBENCH_DEBUG`
|
||||
env var 등) 에서 활성화하는 옵션은 후속 작업으로 가능.
|
||||
|
||||
### A3. VA 와 PA 의 통합 allocator
|
||||
|
||||
기각. VA 공간 (64 GiB device-wide) 과 PA 공간 (slice 별 ~6 GiB) 는 의미
|
||||
차원이 다르다. VA 는 host kernel 의 view, PA 는 device sub-unit 의 view.
|
||||
ADR-0011 의 VA 모델 정신 (MMU 가 둘 사이를 매핑) 과 정합하기 위해
|
||||
allocator 도 분리.
|
||||
|
||||
### A4. page_size 의 multi-tier 지원 (large page + small page)
|
||||
|
||||
기각 (현재). 단일 page_size (현재 2 MiB) 가 LLM kernel 의 텐서 단위 (수
|
||||
MiB~수 GiB) 에 맞고, ADR-0039 D3 의 서브-페이지 region 으로 작은 매핑이
|
||||
필요할 때 흡수된다. multi-tier page 는 MMU 자체 모델을 확장해야 하므로
|
||||
별도 ADR 후보.
|
||||
|
||||
## Consequences
|
||||
|
||||
- allocator 알고리즘이 ADR-level 에서 굳어져 (D1·D3·D4), 새로운 시뮬
|
||||
시나리오에서 fragmentation 이슈가 발생할 때 "여기서 first-fit + coalesce
|
||||
를 쓰고 있다" 가 명확.
|
||||
- D2 의 신뢰 모델이 명시되어, 향후 사용자 입력으로부터 직접 alloc/free 를
|
||||
받는 경로가 도입되면 본 ADR supersede 가 필요함을 일찍 인지 가능.
|
||||
- D7 의 sub-unit별 allocator 1:1 대응이 명시되어, M_CPU/SRAM 별도 영역이
|
||||
필요해질 때 어디에 free-list 를 추가해야 하는지 명확.
|
||||
- `VirtualAllocator` 의 page_size 가 PE_MMU 설정과 일치해야 함이 D4 에
|
||||
적혀 있어, 향후 topology.yaml 의 page_size 변경 시 ADR-0039 stopgap 동작
|
||||
과의 상호작용을 빠르게 가늠 가능.
|
||||
@@ -0,0 +1,231 @@
|
||||
# ADR-0049: `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
`probes/probe.py` 의 `run_probe(...)` 가 노출하는 traffic-pattern catalog,
|
||||
formula vs actual 비교, 그리고 monotonicity / D2H≥H2D 같은 invariant
|
||||
체크의 의미를 명시한다. ADR-0010 (CLI surface) 가 `kernbench probe`
|
||||
subcommand 를 enumerate 하나, **probe 가 실제로 측정하는 것**과 **어떤
|
||||
invariant 를 PASS/FAIL 로 판정하는가**는 ADR-level 에 없었다.
|
||||
|
||||
## First action (제일 처음에 하는 일)
|
||||
|
||||
`run_probe(topology_path, case_filter=None)` 의 첫 4가지 작업:
|
||||
|
||||
1. `Path(topology_path).expanduser().resolve()` 로 절대 경로 산출.
|
||||
2. `load_topology(path)` → `TopologyGraph` 인스턴스 (그래프 + spec).
|
||||
3. `_build_edge_map(graph)` → `{(src, dst): Edge}` 빠른 lookup 테이블.
|
||||
4. `AddressResolver(graph)` + `PathRouter(graph)` 인스턴스화.
|
||||
|
||||
그 다음 `nbytes = 32768` (= 32 KiB, summary table 의 기준 데이터 크기) 와
|
||||
`show_all = (case_filter is None or case_filter == "all")` 를 설정.
|
||||
|
||||
즉, **probe 의 첫 일은 "토폴로지를 한 번 로드하여 edge map / resolver /
|
||||
router 를 준비하고, 32 KiB 라는 표준 측정 크기를 픽스하는 것"**. 그 이후
|
||||
H2D → D2H → PE DMA 세 카테고리의 case 들이 각각 별도의 `GraphEngine`
|
||||
인스턴스에서 실행된다 (case 간 cross-talk 차단).
|
||||
|
||||
## Context
|
||||
|
||||
`kernbench probe` 는 다음 의도로 도입된 verification 도구다:
|
||||
|
||||
- **수동 분석 ground truth**: 실 시뮬레이션 (`kernbench run --bench ...`)
|
||||
결과의 latency 가 비정상으로 보일 때, 단순 traffic pattern 의 정답을 별도
|
||||
로 얻어 비교.
|
||||
- **formula vs actual 비교**: 분석 모델 (wire latency + overhead + drain)
|
||||
과 시뮬레이션 결과 (`total_ns`) 가 일치하는지 확인. 일치하지 않으면 모델
|
||||
단순화 가정 (ADR-0033) 어디가 빠진 것인지 단서.
|
||||
- **monotonicity check**: hop 수가 늘면 latency 가 단조 증가해야 한다는
|
||||
invariant 의 자동 확인.
|
||||
- **utilization sweep**: 데이터 크기 (4 KiB ~ 1 MiB) 별 BW 활용률 표.
|
||||
|
||||
이 도구의 동작 사양이 ADR-level 에 없으면:
|
||||
|
||||
- 다른 형식의 traffic pattern (예: MCpuDma, IPCQ) 을 추가하려는 사람이 기존
|
||||
카테고리의 표 포맷 / 측정 단위를 일관되게 따르기 어렵다.
|
||||
- monotonicity 가 무엇을 기준으로 검사되는지 (hop 수? cube 거리? wire
|
||||
길이?) 모호.
|
||||
- 32 KiB 라는 기준 크기와 `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]` sweep
|
||||
의 의미가 코드 grep 으로만 확인 가능.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. 세 가지 case category — H2D / D2H / PE DMA
|
||||
|
||||
각 category 는 토폴로지 상 별개의 데이터 경로를 가지며, 별도의 summary
|
||||
table + sweep table + route detail block 으로 출력된다.
|
||||
|
||||
- **H2D (Host→Device Write)**: `MemoryWriteMsg(dst_sip=0, dst_cube,
|
||||
dst_pe=0, pattern="zero")` 가 `pcie_ep → io_cpu → m_cpu → hbm_ctrl` 경로
|
||||
를 흐른다. cube 인덱스로 hop 수가 증가:
|
||||
- h2d-1hop: cube=0, hops=1
|
||||
- h2d-2hop: cube=4, hops=2
|
||||
- h2d-3hop: cube=8, hops=3
|
||||
- h2d-4hop: cube=12, hops=4
|
||||
- **D2H (Device→Host Read)**: `MemoryReadMsg(src_sip=0, src_cube, src_pe=0)`.
|
||||
forward command path + reverse data path 의 합 latency. 같은 4 hops
|
||||
카테고리.
|
||||
- **PE DMA (PE-initiated)**: `PeDmaMsg(src_sip, src_cube, src_pe, dst_pa)`.
|
||||
5 가지 케이스로 cube/PE 위치 변화:
|
||||
- pe-local-hbm: same cube, same PE
|
||||
- pe-same-half-hbm: same cube, different PE (PE 1)
|
||||
- pe-cross-half-hbm: same cube, far PE (PE 4)
|
||||
- pe-cross-cube-hbm-best: adjacent cube (cube 1)
|
||||
- pe-cross-cube-hbm-worst: diagonal far cube (cube 15)
|
||||
|
||||
cube 인덱스가 4/8/12 (H2D), 1/4/15 (PE DMA) 같이 의미 있는 이유는
|
||||
4x4 cube mesh (sip.cube_mesh.w=4, h=4) 에서의 거리 정의 — 추후 cube_mesh
|
||||
크기 변경 시 이 값들이 같이 갱신되어야 한다.
|
||||
|
||||
### D2. 표준 측정 크기 — `nbytes = 32768` (32 KiB)
|
||||
|
||||
모든 case 의 summary table 은 `nbytes=32768` 로 한 번 실행한 결과를
|
||||
보여준다. 32 KiB 가 선택된 이유:
|
||||
|
||||
- DMA overhead 와 BW drain 이 한쪽으로 치우치지 않는 적당한 크기.
|
||||
- 다수 sub-unit (TCM, register file) 의 1회 transfer 단위와 비교 가능.
|
||||
|
||||
크기별 utilization 변화는 별도 sweep table 이 보여준다 (D3).
|
||||
|
||||
### D3. Utilization sweep — `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]`
|
||||
|
||||
`SWEEP_SIZES = [4096, 16384, 65536, 262144, 1048576]`, `SWEEP_LABELS =
|
||||
["4KB", "16KB", "64KB", "256KB", "1MB"]`. 매 size 마다 다음 공식:
|
||||
|
||||
```
|
||||
drain = nbytes / bottleneck_bw
|
||||
total = overhead + wire + drain
|
||||
eff_bw = nbytes / total
|
||||
util% = eff_bw / bottleneck_bw × 100
|
||||
```
|
||||
|
||||
`bn_bw is None or <= 0` 이면 그 컬럼은 0.0 % 로 출력. 의미: hop 수가 늘
|
||||
수록 작은 transfer 는 overhead-bound, 큰 transfer 는 drain-bound 가 되는
|
||||
패턴을 한 표에서 확인.
|
||||
|
||||
### D4. 측정 항목 — actual / formula / breakdown
|
||||
|
||||
각 case 행에 표시되는 컬럼:
|
||||
|
||||
- `Actual` (total_ns): SimPy 실행 결과의 `trace["total_ns"]`.
|
||||
- `Ovhd`: 경로상 모든 node 의 `node.attrs["overhead_ns"]` 합 (formula
|
||||
breakdown).
|
||||
- `Drain`: `nbytes / min(edge.bw_gbs over path)` (formula).
|
||||
- `Wire`: `Σ edge.distance_mm * (ns_per_mm from spec)`.
|
||||
- `Ovhd%` / `Drain%`: Ovhd/Drain 이 Actual 에서 차지하는 비율 (formula 의
|
||||
Wire 는 통상 매우 작아 표시하지 않음).
|
||||
- `Eff.BW`: `nbytes / total_ns` (실 측정 BW).
|
||||
- `BN.BW`: bottleneck bandwidth (formula). path 상 모든 edge 의 BW 중 최소.
|
||||
edge BW 가 없으면 "-".
|
||||
- `Util%`: `Eff.BW / BN.BW × 100`. 100% 면 single-stream BW upper bound 에
|
||||
도달.
|
||||
|
||||
formula 의 합 (`wire + ovhd + drain`) 과 actual 의 차이가 크면 모델
|
||||
단순화가 잡지 못하는 요소가 있다는 신호 (ADR-0033 의 가정 점검).
|
||||
|
||||
### D5. Invariant 자동 체크 — PASS/FAIL
|
||||
|
||||
다음 invariant 들이 자동으로 확인되어 `[v] PASS` / `[x] FAIL` 로 출력:
|
||||
|
||||
- **H2D / D2H monotonic increase**: hop 수가 늘면 actual latency 가
|
||||
단조 증가해야 함. `all(lats[i] < lats[i+1] for ...)`.
|
||||
- **D2H ≥ H2D**: 같은 hop 인덱스에서 D2H ≥ H2D (D2H 는 forward command
|
||||
+ reverse data 두 leg 이므로). `all(d2h[i].total >= h2d[i].total)`.
|
||||
- **PE DMA best < worst**: cross-cube best (adjacent) latency < cross-cube
|
||||
worst (diagonal) latency.
|
||||
- **PE DMA local vs remote**: local BN BW vs remote BN BW 의 비교 출력
|
||||
(PASS/FAIL 이 아닌 정보성).
|
||||
|
||||
체크가 FAIL 이면 사람이 즉시 모델/토폴로지 회귀를 인지할 수 있도록 한
|
||||
줄로 분명하게 출력.
|
||||
|
||||
### D6. Route detail — per-hop timestamp trace
|
||||
|
||||
summary 와 sweep 표 이후 각 case 의 path 와 per-hop 누적 시간 (
|
||||
`_hop_timestamps`) 가 별도 섹션에서 출력된다:
|
||||
|
||||
- H2D: leg1 (`pcie_ep → io_cpu`) + leg2 (`io_cpu → m_cpu`) + leg3
|
||||
(`m_cpu → hbm_ctrl`) + per-hop trace.
|
||||
- D2H: forward (cmd, no data) + reverse (data) trace 분리 표시.
|
||||
- PE DMA: `pe_dma → router → hbm_ctrl` path + per-hop trace.
|
||||
|
||||
각 hop 의 timestamp 는 cumulative `wire_ns + overhead_ns` 누적. terminal
|
||||
hop 의 annotation 에 `drain:Xns` 가 붙는다. bottleneck edge 는
|
||||
`<BN:XXGB/s>` 로 표시되어 시각적으로 식별 가능.
|
||||
|
||||
### D7. case_filter 인자의 의미
|
||||
|
||||
- `None` 또는 `"all"`: 모든 case 실행 (default).
|
||||
- 다른 문자열: 그 이름과 정확히 일치하는 case 만 실행. 예: `kernbench
|
||||
probe --case h2d-2hop`.
|
||||
|
||||
각 카테고리 안에서 `name != case_filter` 면 skip 되며, 그 카테고리의
|
||||
monotonicity / D2H≥H2D 비교는 데이터가 1개일 때 자연히 skip 된다.
|
||||
|
||||
CLI parser 의 `--case` 기본값은 `"all"`이라 인자 생략 시 전체 실행.
|
||||
|
||||
### D8. 매 case 별 fresh GraphEngine
|
||||
|
||||
H2D 4개, D2H 4개, PE DMA 5개의 case 가 각각 **새로운 GraphEngine**
|
||||
인스턴스에서 실행된다 (`engine = GraphEngine(graph)`). 이유:
|
||||
|
||||
- case 간 누적 상태 (op_log, completion 추적, allocator 등) 가 cross-talk
|
||||
하지 않도록 격리.
|
||||
- 한 case 의 traffic 이 다른 case 의 BW 측정에 영향을 주지 않도록 보장.
|
||||
|
||||
이 격리는 probe 의 측정 결과를 **각 case 단독 single-flow** 의 latency 로
|
||||
해석할 수 있게 한다. multi-flow contention 측정은 별도 도구 (예:
|
||||
`pe2pe_overview` 플롯, ADR-0033 의 multi-flow merging 모델) 책임.
|
||||
|
||||
### D9. 출력 포맷의 안정성
|
||||
|
||||
probe 의 stdout 출력은 사람이 읽기 위함이며, 정확한 컬럼 폭/구분자/공백 은
|
||||
machine-readable contract 가 아니다. 자동화된 도구가 probe 결과를 파싱
|
||||
하려면 별도 JSON 출력 모드를 추가해야 한다 (현재 미구현).
|
||||
|
||||
PASS/FAIL 줄의 `[v]` / `[x]` 접두사는 CI grep 용 anchor 로 안정 보장.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. Probe 를 별도 bench 로 등록 (`@bench(name="probe")`)
|
||||
|
||||
기각. probe 는 bench 가 아니라 verification 도구로 의도된다 — sweep / 분석
|
||||
용 multi-engine 실행과 invariant PASS/FAIL 출력이 본질이며, ADR-0045 의
|
||||
"단일 디바이스 + 단일 RuntimeContext" bench 모델과 맞지 않는다.
|
||||
|
||||
### A2. monotonicity 위반 시 exit code 1
|
||||
|
||||
기각 (현재). 인간 검사 도구 위주로 의도되어 있어 PASS/FAIL 줄을 출력하고
|
||||
exit 0 로 종료. CI 가 violation 으로 fail 하길 원하면 별도 wrapper 가
|
||||
`grep "\[x\]"` 결과로 판단하면 됨. 후속으로 strict-mode flag (`--strict`)
|
||||
도입 가능.
|
||||
|
||||
### A3. probe 의 case 정의를 외부 YAML 로
|
||||
|
||||
기각 (현재). 8개 case (4 H2D + 4 D2H + 5 PE DMA — 합 13개) 는 코드에
|
||||
하드코딩되어 있고 의미가 토폴로지 mesh 구조에 단단히 묶여 있다. 외부
|
||||
YAML 로 옮기면 cube 인덱스의 의미 (4, 8, 12 / 1, 4, 15) 를 별도로 문서화
|
||||
해야 하므로 응집도 손실. 케이스 추가가 잦아지면 그때 별도 ADR 로 도입.
|
||||
|
||||
### A4. multi-flow contention 측정 추가
|
||||
|
||||
기각 (probe 범위 밖). D8 에서 명시한 single-flow 격리 모델이 probe 의 핵심
|
||||
의도. multi-flow contention 은 ADR-0033 latency model 의 다른 영역으로,
|
||||
별도 도구 또는 별도 case category 로 처리.
|
||||
|
||||
## Consequences
|
||||
|
||||
- probe 의 case catalog (D1) 와 측정 단위 (D2/D3) 가 ADR-level 에서 명시
|
||||
되어, 새 traffic 카테고리 추가 시 어떤 표 포맷을 따라야 하는지 분명.
|
||||
- formula vs actual 의 컬럼 의미 (D4) 가 굳어져, probe 결과를 보고 "왜
|
||||
Drain% 가 5% 인가 / 70% 인가" 같은 질문을 빠르게 ADR-0033 가정 점검으로
|
||||
연결 가능.
|
||||
- invariant 자동 체크 (D5) 가 ADR 에 굳어져, 향후 latency 모델 변경 시
|
||||
monotonicity / D2H≥H2D 회귀를 probe 가 즉시 잡아낸다는 안전망 정착.
|
||||
- D8 의 case 간 격리가 명시되어, probe 결과를 single-flow 측정으로 안전
|
||||
하게 해석 가능. multi-flow 측정이 필요해지면 별도 도구 트랙이 필요함이
|
||||
분명.
|
||||
- A2 의 strict-mode flag 가 후속 작업 후보로 기록되어, CI 통합 요구 시
|
||||
최소 추가 작업으로 도입 가능.
|
||||
@@ -0,0 +1,308 @@
|
||||
# ADR-0050: CCL Algorithm Module Contract — `ccl/algorithms/*.py`
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
`src/kernbench/ccl/algorithms/` 디렉터리 안의 모듈이 AHBM CCL backend
|
||||
(ADR-0047) 에서 collective algorithm 으로 사용되려면 갖춰야 할 인터페이스,
|
||||
kernel 시그너처, 그리고 새 알고리즘 추가 절차를 명시한다. ADR-0047 D3 가
|
||||
"algorithm 모듈은 `kernel`, `kernel_args`, optional `TOPO_NAME_TO_KIND` 를
|
||||
expose 해야 한다" 라고만 한 줄로 언급하나, **algorithm 모듈 작성자가 따라야
|
||||
할 contract** 는 ADR-level 에서 정리된 적이 없다. ADR-0045 가 bench 모듈
|
||||
contract 를 다루는 것과 짝을 이룬다.
|
||||
|
||||
## First action (제일 처음에 하는 일)
|
||||
|
||||
알고리즘 모듈이 import 되는 시점은 두 가지다:
|
||||
|
||||
1. **AHBM backend 진입**: 사용자 코드가 `dist.init_process_group(backend="ahbm")`
|
||||
를 호출하면, `AhbmCCLBackend.__init__` 안에서 `self._algo_module =
|
||||
importlib.import_module(self._merged["module"])` 가 실행된다. 이때 모듈
|
||||
레벨에서 가장 먼저 일어나는 일:
|
||||
- `SIP_TOPO_RING/TORUS/MESH` 같은 정수 상수가 모듈 namespace 에 노출.
|
||||
- `TOPO_NAME_TO_KIND` 사전이 모듈 namespace 에 노출 — backend 가
|
||||
`topo_map = getattr(self._algo_module, "TOPO_NAME_TO_KIND", None)` 로
|
||||
조회.
|
||||
- `kernel_args` 함수 정의 — 호출 시 호출자가 사용.
|
||||
- `allreduce_intercube_multidevice` 같은 알고리즘 함수 정의.
|
||||
- 모듈 마지막 줄에서 `kernel = allreduce_intercube_multidevice` 로
|
||||
alias 가 노출.
|
||||
|
||||
2. **ccl.yaml install 단계**: `kernbench.ccl.install.install_ipcq` 가 호출
|
||||
되어 IPCQ neighbor table 을 푸시할 때 같은 알고리즘 모듈이 import 됨.
|
||||
|
||||
즉, **algorithm 모듈의 첫 일은 "topology-kind 상수, `TOPO_NAME_TO_KIND`
|
||||
사전, `kernel_args` 함수, 그리고 `kernel` alias 를 모듈 namespace 에 노출
|
||||
하는 것"** 이다. 모든 노출은 import-time 부수효과로 충분하며 별도 초기화
|
||||
함수 호출이 필요하지 않다.
|
||||
|
||||
## Context
|
||||
|
||||
`AhbmCCLBackend` (ADR-0047) 는 process group 초기화 시점에 `ccl.yaml` 의
|
||||
`defaults.algorithm` (또는 사용자가 지정한 알고리즘 이름) 으로부터 모듈
|
||||
경로를 얻어 dynamic import 한다. backend 는 그 모듈로부터 다음 4 가지를
|
||||
기대한다:
|
||||
|
||||
- `kernel`: collective 의 진입 함수.
|
||||
- `kernel_args(world_size, n_elem, cube_w=, cube_h=) -> tuple`: kernel 에
|
||||
넘길 위치 인자 묶음.
|
||||
- `TOPO_NAME_TO_KIND` (optional): `topology.yaml` 의 `sips.topology`
|
||||
문자열 (예: `"ring_1d"`, `"torus_2d"`, `"mesh_2d_no_wrap"`) 을 정수
|
||||
상수로 매핑하는 dict.
|
||||
- (간접) IPCQ neighbor table 설치: `configure_sfr_intercube_multisip` 가
|
||||
알고리즘 모듈의 `TOPO_NAME_TO_KIND` 와 `cube_w/h` 를 보고 SFR 을 결정.
|
||||
|
||||
현재 코퍼스의 유일한 algorithm 모듈은 `lrab_hierarchical_allreduce.py`
|
||||
(248 줄) 이다. 이름은 "**l**eft-**r**ight **a**lternating **b**roadcast
|
||||
**hierarchical allreduce**". 향후 `ring_allreduce`, `tree_allreduce`,
|
||||
`broadcast` 같은 모듈이 추가될 때마다 이 contract 를 따라야 일관된
|
||||
디스패치가 가능하다.
|
||||
|
||||
이 contract 가 ADR-level 에 없으면:
|
||||
|
||||
- 새 algorithm 작성자가 ADR-0047 D3 의 한 줄 만으로 시그너처를 추론해야.
|
||||
- kernel 함수 인자 순서 (특히 `t_ptr, n_elem, cube_w, cube_h, n_sips,
|
||||
sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl`) 의 의미가 코드
|
||||
grep 없이는 명확하지 않다.
|
||||
- `kernel_args` 가 어떤 인자를 받고 어떤 tuple 을 돌려줘야 하는지 관례
|
||||
로만 굳어진다.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. algorithm 모듈은 4 가지 public symbol 을 노출한다
|
||||
|
||||
```python
|
||||
# src/kernbench/ccl/algorithms/<name>.py
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
# (필수) topology-kind 상수 — 알고리즘 내부에서 사용
|
||||
SIP_TOPO_RING = 0
|
||||
SIP_TOPO_TORUS = 1
|
||||
SIP_TOPO_MESH = 2
|
||||
|
||||
# (선택) topology 이름 → kind 매핑. backend 가 ccl.yaml/topology 의
|
||||
# 문자열 SIP topology 를 정수로 변환하는 데 사용.
|
||||
TOPO_NAME_TO_KIND = {
|
||||
"ring_1d": SIP_TOPO_RING,
|
||||
"torus_2d": SIP_TOPO_TORUS,
|
||||
"mesh_2d_no_wrap": SIP_TOPO_MESH,
|
||||
}
|
||||
|
||||
# (필수) kernel 인자 빌더
|
||||
def kernel_args(world_size: int, n_elem: int, *, cube_w: int = 4, cube_h: int = 4) -> tuple:
|
||||
return (n_elem, cube_w, cube_h, world_size)
|
||||
|
||||
# (필수) kernel 함수 (`tl=...` 키워드를 통해 TLContext 가 주입됨)
|
||||
def my_allreduce_kernel(t_ptr, n_elem, cube_w, cube_h, n_sips,
|
||||
sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, *, tl):
|
||||
...
|
||||
|
||||
# (필수) kernel alias — backend 가 `module.kernel` 로 접근
|
||||
kernel = my_allreduce_kernel
|
||||
```
|
||||
|
||||
- `kernel` alias 는 backend 가 직접 호출하는 entry point 다. 함수 이름이
|
||||
무엇이든 (`allreduce_intercube_multidevice` 처럼) `module.kernel = fn`
|
||||
으로 노출해야 한다.
|
||||
- `kernel_args` 가 없으면 backend 가 알고리즘 인자를 만들 방법이 없다.
|
||||
signature 는 D2 참고.
|
||||
- `TOPO_NAME_TO_KIND` 가 없으면 backend 는 `sip_topo_kind = 0` 으로
|
||||
fallback 한다. 단일 topology 만 지원하는 알고리즘이라면 생략 가능.
|
||||
|
||||
### D2. `kernel_args` 시그너처 — `(world_size, n_elem, *, cube_w, cube_h)`
|
||||
|
||||
```python
|
||||
def kernel_args(world_size: int, n_elem: int, *,
|
||||
cube_w: int = 4, cube_h: int = 4) -> tuple:
|
||||
return (n_elem, cube_w, cube_h, world_size)
|
||||
```
|
||||
|
||||
- **위치 인자**: `world_size` (= rank 수), `n_elem` (= 단일 shard 의
|
||||
element 수, f16 기준).
|
||||
- **키워드 인자**: `cube_w`, `cube_h` (= cube mesh 크기). default 는
|
||||
4×4 — `topology.yaml` 의 `sip.cube_mesh` 기본값과 정합.
|
||||
- **반환**: kernel 의 위치 인자 순서대로 묶은 tuple.
|
||||
|
||||
backend 의 `all_reduce` 가 호출 시:
|
||||
|
||||
```python
|
||||
kernel_args_tuple = self._algo_module.kernel_args(
|
||||
self._world_size, n_elem, cube_w=eff_cube_w, cube_h=eff_cube_h,
|
||||
)
|
||||
extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)
|
||||
pending = self.ctx.launch(
|
||||
self._merged["algorithm"], kernel_fn, tensor,
|
||||
*kernel_args_tuple, *extra_args, _defer_wait=True,
|
||||
)
|
||||
```
|
||||
|
||||
즉 kernel 의 최종 위치 인자는: `(tensor_ptr, *kernel_args_tuple,
|
||||
sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` 이며, 거기에 `tl=...` 가
|
||||
키워드로 자동 주입된다. `kernel_args` 가 돌려주는 tuple 의 길이/순서는
|
||||
**kernel signature 와 1:1 일치** 해야 한다.
|
||||
|
||||
### D3. `kernel` 함수 시그너처 — 정형화된 9 + tl 인자
|
||||
|
||||
권장 시그너처:
|
||||
|
||||
```python
|
||||
def my_kernel(
|
||||
t_ptr: int, # VA base of the row-wise-sharded tensor on this SIP
|
||||
n_elem: int, # element count per cube tile (or per shard)
|
||||
cube_w: int, # cube mesh width (kernel_args 에서 옴)
|
||||
cube_h: int, # cube mesh height (kernel_args 에서 옴)
|
||||
n_sips: int, # world_size 와 동일 (rank = SIP, ADR-0024)
|
||||
sip_rank: int, # 이 SIP 의 rank
|
||||
sip_topo_kind: int, # TOPO_NAME_TO_KIND lookup 결과
|
||||
sip_topo_w: int, # SIP mesh width (ring_1d 면 0)
|
||||
sip_topo_h: int, # SIP mesh height (ring_1d 면 0)
|
||||
*, tl, # TLContext (auto-injected)
|
||||
) -> None:
|
||||
```
|
||||
|
||||
`kernel_args` 가 다른 위치 인자 순서를 채택하더라도, kernel 의 **마지막
|
||||
4 개 위치 인자는 항상 `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)`**
|
||||
이며 backend 가 `extra_args` 로 append 한다 (ADR-0047 D5). 이 4 개 인자는
|
||||
사용자 정의 algorithm 도 받아야 하지만, 알고리즘이 single-SIP 이라면
|
||||
그냥 무시하면 된다.
|
||||
|
||||
`tl` 은 위치 인자가 아닌 키워드로 주입된다 — `RuntimeContext.launch` 가
|
||||
kernel 호출 직전에 `tl=tl_ctx` 를 추가한다. 따라서 kernel signature 의
|
||||
`tl` 은 keyword-only (`*, tl`) 또는 마지막 키워드 매개변수 형태여야
|
||||
한다.
|
||||
|
||||
### D4. kernel body 의 자유도와 제약
|
||||
|
||||
kernel body 안에서 사용 가능한 표면: ADR-0046 D3 의 모든 `tl.*` primitive.
|
||||
|
||||
특히 자주 쓰이는 패턴:
|
||||
|
||||
- `cube_id = tl.program_id(axis=1)` — 이 PE 가 속한 cube 인덱스.
|
||||
- `pe_addr = t_ptr + cube_id * nbytes` — cube-별 tile 의 VA 계산.
|
||||
- `acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")` — local 데이터
|
||||
로드.
|
||||
- `tl.send(dir=...)` / `tl.recv(dir=..., shape=, dtype=)` — IPCQ
|
||||
collective.
|
||||
- `acc = acc + recv` — TensorHandle 산술 연산자 (ADR-0046 D4).
|
||||
- `tl.store(pe_addr, acc)` — 결과 저장.
|
||||
|
||||
kernel body 는 일반 Python 함수이며, branching/looping 자유. 단:
|
||||
|
||||
- SimPy `yield` 또는 `async` 금지 (ADR-0046 D1).
|
||||
- TensorHandle 의 `.data` 직접 접근 금지 — phase 1 timing 모델은
|
||||
데이터 의존을 모른다 (ADR-0020 의 2-pass 분리).
|
||||
- kernel 실행은 deterministic 해야 한다 — 같은 입력으로 두 번 실행하면
|
||||
같은 op 시퀀스 발사. random / external IO 금지.
|
||||
|
||||
### D5. SIP topology semantics — `sip_topo_kind` 의 의미
|
||||
|
||||
backend 가 `topology.yaml` 의 `system.sips.topology` 문자열을 algorithm
|
||||
모듈의 `TOPO_NAME_TO_KIND` 로 lookup 하여 `sip_topo_kind` 정수로 변환.
|
||||
algorithm 은 이 정수를 보고 분기:
|
||||
|
||||
```python
|
||||
if sip_topo_kind == SIP_TOPO_RING:
|
||||
acc = _inter_sip_ring(...)
|
||||
elif sip_topo_kind == SIP_TOPO_TORUS:
|
||||
acc = _inter_sip_torus_2d(...)
|
||||
elif sip_topo_kind == SIP_TOPO_MESH:
|
||||
acc = _inter_sip_mesh_2d(...)
|
||||
```
|
||||
|
||||
각 topology branch 는 IPCQ direction 이름 (예: `"global_E"`, `"W"`, `"S"`,
|
||||
`"N"`) 을 통해 peer 와 통신. direction 의 의미는 ADR-0023/0025 가 정의
|
||||
하며, `configure_sfr_intercube_multisip` 가 IPCQ neighbor table 을 그에
|
||||
맞춰 설치한다.
|
||||
|
||||
algorithm 모듈은 자기가 지원하지 않는 topology kind 가 들어오면 silent
|
||||
no-op 으로 두기보다 명시적으로 `raise ValueError(f"unsupported topology
|
||||
kind {sip_topo_kind}")` 하는 것을 권장 — 실수로 backend 에 잘못 dispatch
|
||||
된 경우 빠르게 fail.
|
||||
|
||||
### D6. ccl.yaml 의 algorithm entry 구조
|
||||
|
||||
algorithm 모듈은 `ccl.yaml` 의 entry 와 짝을 이룬다 (ADR-0023 D10 +
|
||||
ADR-0047 D3):
|
||||
|
||||
```yaml
|
||||
defaults:
|
||||
algorithm: lrab_hierarchical_allreduce
|
||||
n_elem: 8
|
||||
|
||||
algorithms:
|
||||
lrab_hierarchical_allreduce:
|
||||
module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
|
||||
# optional: world_size override
|
||||
# optional: per-algorithm parameters consumed by configure_sfr_intercube_multisip
|
||||
```
|
||||
|
||||
- `module`: full Python module path. backend 의 `importlib.import_module`
|
||||
가 이 문자열을 그대로 사용.
|
||||
- `world_size` (optional): 명시되면 topology fallback 을 override
|
||||
(ADR-0047 D2).
|
||||
- algorithm-specific parameters 는 `configure_sfr_intercube_multisip` 가
|
||||
소비.
|
||||
|
||||
새 algorithm 추가 시:
|
||||
|
||||
1. `src/kernbench/ccl/algorithms/<name>.py` 작성 (D1 컨벤션).
|
||||
2. `ccl.yaml` 의 `algorithms` 섹션에 entry 추가.
|
||||
3. (필요 시) `kernbench.ccl.sfr_config` 에 SFR 설치 분기 추가.
|
||||
4. test 추가 (예: `tests/sccl/test_<name>.py`, ADR-0043 의 eval harness
|
||||
확장).
|
||||
|
||||
### D7. legacy "rank = flat PE index" 모드
|
||||
|
||||
ADR-0047 D2 가 명시한 `ccl.yaml` 의 `world_size` override 경로는 legacy
|
||||
"rank = flat PE index" 테스트가 사용한다. algorithm 모듈은 이 모드 에서도
|
||||
`n_sips=world_size` 만큼의 rank 가 들어옴을 가정하면 된다 — backend 가
|
||||
rank↔(SIP, cube, PE) 매핑을 사전에 분리해 두므로 algorithm 본체에서는
|
||||
modal 분기가 필요 없다.
|
||||
|
||||
단, single-cube workload 에서는 `cube_w=cube_h=1` 이 들어와 mesh-기반
|
||||
phase 들이 skip 되도록 작성해야 한다 (`lrab_hierarchical_allreduce.py`
|
||||
의 `single_cube = (cube_w == 1 and cube_h == 1)` 패턴 참고).
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. algorithm 모듈을 class 로 구조화 (`class Allreduce: kernel(...)` 등)
|
||||
|
||||
기각. Python 모듈 namespace 자체가 algorithm 의 identity 로 사용 중이며
|
||||
(ADR-0047 D3 의 `importlib.import_module`), class 한 겹은 추가 indirection
|
||||
만 늘리고 dispatch 측 코드를 두텁게 만든다. 모듈-레벨 free function
|
||||
+ `kernel` alias 패턴이 충분히 명확.
|
||||
|
||||
### A2. `kernel_args` 를 명시적 dataclass 로 typing
|
||||
|
||||
기각 (현재). algorithm 마다 인자 갯수가 다른 것이 정상이며, dataclass 한
|
||||
종류를 강제하면 다양한 algorithm 간 호환이 어려워진다. tuple 반환은 simple
|
||||
하고 backend 측 `*kernel_args_tuple` unpacking 과 깨끗이 맞물린다.
|
||||
algorithm 별 자체 타입 강도가 필요해지면 그 algorithm 모듈 안에서 NamedTuple
|
||||
사용은 자유.
|
||||
|
||||
### A3. SFR 설치를 algorithm 모듈 안으로
|
||||
|
||||
기각. SFR 설치 (`configure_sfr_intercube_multisip`) 는 topology + algorithm
|
||||
모두를 보고 IPCQ neighbor table 을 설치하는 cross-module 결정이라, algorithm
|
||||
모듈 내부보다 `kernbench.ccl.sfr_config` 같은 전용 위치가 자연스럽다. D6 의
|
||||
"필요 시 sfr_config 분기 추가" 워크플로우가 책임 분리 측면에서 더 명확.
|
||||
|
||||
### A4. algorithm name 을 모듈 namespace 에 자동 등록 (decorator)
|
||||
|
||||
기각. ADR-0045 (bench) 와 달리 algorithm 은 ccl.yaml entry 와 직접 묶여
|
||||
있어 추가 등록 레지스트리가 중복이다. `module` 문자열 매핑 하나면 충분.
|
||||
|
||||
## Consequences
|
||||
|
||||
- ADR-0047 D3 의 한 줄 contract 가 D1–D7 의 작성자-친화적 가이드로 확장
|
||||
되어, 새 algorithm 추가 시 시그너처를 grep 으로 추론할 필요 없음.
|
||||
- D3 의 9 + tl 인자 시그너처가 표준화되어, backend 의 `extra_args` append
|
||||
(ADR-0047 D5) 와 자연스럽게 맞물림. 향후 single-SIP-only algorithm 도
|
||||
4 개의 sip_* 인자를 받아야 함이 명시.
|
||||
- D5 의 fail-loud 권장으로, ccl.yaml 의 topology 가 algorithm 미지원
|
||||
topology 로 잘못 설정되면 backend 가 silent wrong-result 가 아닌
|
||||
ValueError 로 fail.
|
||||
- D6 의 단계별 추가 절차가 명시되어, 새 algorithm 추가가 sfr_config /
|
||||
test / ccl.yaml 어디까지 손대야 하는지 분명.
|
||||
@@ -0,0 +1,267 @@
|
||||
# ADR-0051: Routing Helper API — `AddressResolver` + `PathRouter`
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
`policy/routing/router.py` 가 노출하는 두 helper 클래스
|
||||
(`AddressResolver`, `PathRouter`) 의 모든 public API, 인자, 반환 값,
|
||||
그리고 네 가지 다른 adjacency graph 의 사용처를 명시한다. ADR-0002 가
|
||||
routing distance 와 ordering, bypass 규칙을 정의하나, **helper API 표면
|
||||
자체** 는 ADR-level 에 정리된 적이 없다.
|
||||
|
||||
## First action (제일 처음에 하는 일)
|
||||
|
||||
### `AddressResolver(graph)`
|
||||
|
||||
생성 즉시 다음 두 가지를 캐시한다:
|
||||
|
||||
1. `self._node_ids = set(graph.nodes)` — 모든 node id 의 set (lookup 용).
|
||||
2. `self._hbm_slice_bytes = hbm_total_gb * (1 << 30) // slices_per_cube` —
|
||||
`graph.spec.cube.memory_map` 으로부터 산출 (기본 `48 GB / 8 slices = 6
|
||||
GB`). 이 값이 `resolve()` 가 HBM PA 의 `hbm_offset` 에서 `pe_id` 를
|
||||
복원하는 데 쓰인다.
|
||||
|
||||
즉, **AddressResolver 의 첫 일은 "전체 node id 집합과 HBM slice 크기를
|
||||
미리 계산해 두는 것"** 이다. graph 자체는 보유하지 않는다.
|
||||
|
||||
### `PathRouter(graph)`
|
||||
|
||||
생성 즉시 **네 개의 별도 adjacency graph 를 동시 구축**한다:
|
||||
|
||||
1. `self._adj_all`: 모든 edge 포함 (component-to-component routing 용).
|
||||
2. `self._adj`: `kind != "command"` 인 edge 만 (PE DMA / 일반 data path).
|
||||
3. `self._adj_mcpu_dma`: `_MCPU_DMA_EXCLUDE = {"pe_internal",
|
||||
"pe_to_router"}` 를 제외 (M_CPU DMA 가 PE pipeline 노드로 잘못 라우팅
|
||||
되지 않게).
|
||||
4. `self._adj_local`: `_UCIE_KINDS` 8 종을 제외 (cube-local routing 용 —
|
||||
UCIe 가 zero-distance bus 처럼 보여 Dijkstra 가 mesh 보다 선호하는
|
||||
것을 막음).
|
||||
|
||||
각 그래프는 `defaultdict(list)` of `(neighbor, weight)` 형태이며,
|
||||
`edge.routing_weight_mm or edge.distance_mm` 이 weight 로 쓰인다.
|
||||
|
||||
즉, **PathRouter 의 첫 일은 "topology edge 들을 4개의 다른 정책으로 동시
|
||||
분류하여 4 개의 인접 리스트로 구축하는 것"**. 매 `find_*()` 호출 시 적절
|
||||
한 그래프를 골라 Dijkstra 를 돌린다.
|
||||
|
||||
## Context
|
||||
|
||||
`policy/routing/router.py` 는 다음 두 책임을 함께 수행한다:
|
||||
|
||||
- **이름 매핑**: 토폴로지 명명 규칙 (`sip{S}.cube{C}.<comp>`,
|
||||
`sip{S}.io{I}.pcie_ep` 등) 의 단일 소유자. 컴포넌트 / probe / IPCQ
|
||||
install / runtime API 가 이름 문자열을 직접 만들지 않고 helper 를 호출.
|
||||
- **경로 결정**: edge 의 `kind` 에 따른 정책 분리. 같은 src→dst 라도
|
||||
routing 의도 (PE DMA vs M_CPU DMA vs general component routing) 에 따라
|
||||
다른 adjacency 를 사용해야 결과가 달라진다.
|
||||
|
||||
이 helper API 가 코드 전반에서 광범위하게 소비되는데도 (probe.py /
|
||||
distributed.py / install.py / 각종 component / tests), ADR-level 에서
|
||||
**정확한 시그너처 / 반환 의미 / 어떤 adjacency 를 쓰는지** 가 한 곳에
|
||||
정리되어 있지 않다. 본 ADR 이 그 빈자리를 채운다.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. `AddressResolver` 의 5 개 public API
|
||||
|
||||
#### D1.1. `resolve(addr: PhysAddr) -> str`
|
||||
|
||||
`PhysAddr` 인스턴스를 토폴로지의 destination node id 로 변환.
|
||||
|
||||
```
|
||||
addr.kind == "hbm" → f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
|
||||
where pe_id = addr.hbm_offset // self._hbm_slice_bytes (ADR-0017 D4/D9)
|
||||
|
||||
addr.kind == "pe_resource":
|
||||
addr.unit_type == PE → f"sip{s}.cube{d}.pe{addr.pe_id}.pe_tcm"
|
||||
addr.unit_type == SRAM → f"sip{s}.cube{d}.sram"
|
||||
addr.unit_type == MCPU → f"sip{s}.cube{d}.m_cpu"
|
||||
그 외 → RoutingError("unsupported unit_type")
|
||||
|
||||
다른 kind → RoutingError("unsupported address kind")
|
||||
```
|
||||
|
||||
산출된 node id 가 `self._node_ids` 에 없으면 `RoutingError(f"node {node_id}
|
||||
not found in topology")`. 즉, address 의 syntax 가 valid 해도 topology 에
|
||||
실제로 매핑되는 노드가 없으면 fail-loud.
|
||||
|
||||
#### D1.2. `find_m_cpu(sip, cube) -> str`
|
||||
|
||||
`f"sip{sip}.cube{cube}.m_cpu"`. 없으면 `RoutingError`.
|
||||
|
||||
#### D1.3. `find_pcie_ep(sip, io_id="io0") -> str`
|
||||
|
||||
`f"sip{sip}.{io_id}.pcie_ep"`. 없으면 `RoutingError`.
|
||||
|
||||
#### D1.4. `find_io_cpu(sip, io_id="io0") -> str`
|
||||
|
||||
`f"sip{sip}.{io_id}.io_cpu"`. 없으면 `RoutingError`.
|
||||
|
||||
#### D1.5. `find_all_pcie_eps() -> list[str]`
|
||||
|
||||
전 SIP 의 PCIE_EP node id 를 정렬된 리스트로 반환. `endswith(".pcie_ep")`
|
||||
필터링. cross-SIP IPCQ 가 모든 PCIE_EP 를 enumerate 할 때 사용.
|
||||
|
||||
명명 규칙 (`sip{S}.cube{C}.<comp>`, `sip{S}.{io_id}.<comp>`) 의 단일
|
||||
소유자가 이 클래스다 (ADR-0015 D4). 토폴로지 빌더가 같은 명명 규칙으로
|
||||
노드를 만들고, 컴포넌트는 이름 문자열을 절대 직접 구성하지 않는다 —
|
||||
모두 helper 를 거친다.
|
||||
|
||||
### D2. `PathRouter` 의 4 개 adjacency graph
|
||||
|
||||
생성자가 한 번에 구축. edge `kind` 가 정책을 결정:
|
||||
|
||||
| graph | 제외 edge kinds | 용도 |
|
||||
|-------------------|-----------------------------------------------|--------------------------------------------|
|
||||
| `_adj_all` | (none) | M_CPU↔NOC command 포함, IO_CPU/M_CPU routing |
|
||||
| `_adj` | `"command"` | PE DMA / 일반 data path |
|
||||
| `_adj_mcpu_dma` | `"pe_internal"`, `"pe_to_router"` | M_CPU DMA (PE pipeline 우회) |
|
||||
| `_adj_local` | `_UCIE_KINDS` (`ucie_internal`, `ucie_conn_to_router`, `router_to_ucie_conn`, `ucie_conn_to_noc`, `noc_to_ucie_conn`, `ucie_mesh`, `io_to_cube`, `cube_to_io`) | same-cube routing (UCIe bus 우회) |
|
||||
|
||||
각 그래프는 `dict[node_id, list[(neighbor, weight)]]` 이며, weight 는
|
||||
`edge.routing_weight_mm or edge.distance_mm`. command edge 의 routing
|
||||
영향력을 명시적으로 가르고, UCIe 의 "0-distance bus" 가 mesh 보다 선호
|
||||
되는 것을 막기 위한 `_adj_local` 분리가 ADR-0017 D7 의 cross-PE-slice
|
||||
mesh-distance 요구와 정합.
|
||||
|
||||
### D3. `PathRouter` 의 6 개 public API (+ 2 backward-compat)
|
||||
|
||||
#### D3.1. `find_path(src_pe: str, dst_node: str) -> list[str]`
|
||||
|
||||
**PE DMA routing**. `src_pe` 는 PE prefix (예: `"sip0.cube0.pe0"`) 이며,
|
||||
함수가 `.pe_dma` 를 자동으로 prepend 하여 실제 시작 노드를
|
||||
`"sip0.cube0.pe0.pe_dma"` 로 설정.
|
||||
|
||||
cube-local 여부 (`_same_cube`) 에 따라 adjacency 선택:
|
||||
|
||||
- **same-cube** (src 와 dst 가 `sip{S}.cube{C}.` prefix 공유):
|
||||
`_adj_local` 사용. UCIe 우회를 막아 cross-PE-slice 가 mesh 거리를 정확
|
||||
히 지불 (ADR-0017 D7).
|
||||
- **cross-cube**: `_adj` 사용. UCIe 가 자연스럽게 cross-cube path 의
|
||||
최적 선택지로 포함됨.
|
||||
|
||||
#### D3.2. `find_path_with_distance(src_pe, dst_node) -> tuple[list[str], float]`
|
||||
|
||||
D3.1 과 동일한 adjacency 정책을 사용하나, 결과로 `(path, total_distance)`
|
||||
를 함께 반환. probe / 분석 도구에서 distance 메트릭이 필요할 때 사용.
|
||||
|
||||
#### D3.3. `find_mcpu_dma_path(m_cpu_id: str, dst_hbm_id: str) -> list[str]`
|
||||
|
||||
**M_CPU DMA path**. cube 가 같으면 `_adj_local` (mesh 안에서 마무리), 다르
|
||||
면 `_adj_all` (UCIe 경유). `_MCPU_DMA_EXCLUDE` 가 PE pipeline 노드를 자동
|
||||
배제하므로, M_CPU 가 PE 의 내부 stage 를 거쳐 routing 되는 잘못된 경로가
|
||||
나오지 않는다.
|
||||
|
||||
#### D3.4. `find_memory_path(src: str, dst: str) -> list[str]`
|
||||
|
||||
`pcie_ep → io_noc → cube → router mesh → hbm_ctrl` 같은 직접 메모리
|
||||
경로. `_adj_mcpu_dma` 를 사용하여 `pe_internal` 및 `pe_to_router` edge
|
||||
를 제외 — host-issued read/write 가 PE pipeline 으로 새지 않게 보장.
|
||||
probe (ADR-0049 D1 의 H2D/D2H case) 에서 직접 호출.
|
||||
|
||||
#### D3.5. `find_node_path(src: str, dst: str) -> list[str]`
|
||||
|
||||
임의의 두 node 사이의 path. **command edge 포함** (`_adj_all` 사용). M_CPU
|
||||
↔ NOC 같은 command-kind link 를 거쳐야 하는 IoCpuComponent /
|
||||
MCpuComponent 등이 호출.
|
||||
|
||||
#### D3.6. backward-compat shims
|
||||
|
||||
- `_dijkstra(start, goal) -> list[str]` — `_run_dijkstra(self._adj, …)`
|
||||
의 thin wrapper.
|
||||
- `_dijkstra_with_dist(start, goal) -> tuple[list[str], float]` — distance
|
||||
포함 버전.
|
||||
|
||||
언더스코어 prefix 에서 보듯이 내부 API 인 척이지만 기존 테스트가 직접
|
||||
호출. 새 코드는 D3.1–D3.5 를 사용하고, 이 두 shim 은 deprecation 후보.
|
||||
|
||||
### D4. Dijkstra 알고리즘 — single-source shortest path
|
||||
|
||||
`_run_dijkstra_with_dist(adj, start, goal)`:
|
||||
|
||||
- `heapq` priority queue.
|
||||
- `best: dict[node, distance]` — 노드별 최단 거리 캐시.
|
||||
- `prev: dict[node, predecessor]` — path reconstruction.
|
||||
- weight 는 `routing_weight_mm or distance_mm`. UCIe 처럼 routing_weight 가
|
||||
명시되어 distance 와 다른 edge 가 있으므로 weight 분리가 의도된 것.
|
||||
|
||||
`start == goal` 은 빠른 path `([start], 0.0)` 반환. 도달 불가는
|
||||
`RoutingError(f"no path from {start} to {goal}")`.
|
||||
|
||||
이 알고리즘은 **deterministic** 하다 — 같은 graph + start/goal 이면 같은
|
||||
경로. 이는 SPEC R1 의 "Routing MUST be deterministic" 요구와 정합. tie-
|
||||
break 는 `heapq` 의 push 순서를 따른다 (Python list 순서가 deterministic).
|
||||
|
||||
### D5. helper API 의 단일 소유자 원칙
|
||||
|
||||
다음 정보는 오직 router.py 안에서만 결정된다:
|
||||
|
||||
- 명명 규칙: `sip{S}.cube{C}.<comp>`, `sip{S}.{io_id}.<comp>`,
|
||||
`sip{S}.cube{C}.hbm_ctrl.pe{pe_id}`.
|
||||
- adjacency 정책: 어떤 edge kind 가 어떤 그래프에 포함되는가.
|
||||
- HBM slice 크기로부터 PE id 복원 방법.
|
||||
- Dijkstra의 weight 결정 (`routing_weight_mm or distance_mm`).
|
||||
|
||||
이 단일 소유자 원칙이 깨지면 (예: 컴포넌트가 자체적으로 `f"sip{s}..."` 를
|
||||
구성하기 시작하면) 명명 규칙 변경 시 영향 범위가 폭발한다. ADR-0015 D4 의
|
||||
정신과 정렬.
|
||||
|
||||
### D6. helper API consumer 의 목록
|
||||
|
||||
본 helper 가 노출하는 메소드를 호출하는 곳을 명시 (현재 코퍼스 기준):
|
||||
|
||||
- `probes/probe.py` (ADR-0049): `find_pcie_ep`, `find_io_cpu`,
|
||||
`find_m_cpu`, `find_node_path`, `find_mcpu_dma_path`,
|
||||
`find_memory_path`, `find_path`, `resolve`.
|
||||
- `runtime_api/distributed.py` (ADR-0047): 간접 (engine 내부 routing).
|
||||
- `ccl/install.py` (ADR-0023): `find_all_pcie_eps`, `resolve`.
|
||||
- `sim_engine/event_log.py`: probe 와 유사하게 `find_pcie_ep`,
|
||||
`find_memory_path`.
|
||||
- `components/builtin/m_cpu.py`, `components/builtin/io_cpu.py`:
|
||||
`find_node_path`, `find_mcpu_dma_path`.
|
||||
- 각종 tests (test_routing.py, test_cross_sip_routing.py 등): D3.1–D3.5
|
||||
대부분.
|
||||
|
||||
새 consumer 가 추가될 때 본 ADR 의 D1/D3 가 그 의도에 맞는 메소드가
|
||||
이미 있는지 / 새 메소드를 추가해야 하는지 1차 판단의 기준이 된다.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. 단일 adjacency graph + edge-kind filter 동적 적용
|
||||
|
||||
기각. 매 `find_*()` 마다 graph filtering 을 다시 하면 Dijkstra 의 cache
|
||||
locality 와 성능이 떨어진다. 4 개 그래프 동시 구축 (D2) 은 메모리 비용
|
||||
이 작고 (edge ≤ 수만 건 규모), 호출 시점에 정책 선택이 O(1) 로 결정.
|
||||
|
||||
### A2. adjacency 분리를 edge 의 `kind` 가 아닌 별도 metadata 로
|
||||
|
||||
기각. edge `kind` 는 이미 topology builder 가 부여하며 (ADR-0015 D4 +
|
||||
ADR-0017), 별도 metadata 를 도입하면 두 시스템이 동기화되어야 하는
|
||||
중복이 생긴다.
|
||||
|
||||
### A3. Dijkstra 대신 BFS + uniform weight
|
||||
|
||||
기각. routing_weight_mm 이 edge 별로 다른 (mesh link / UCIe / IO-internal)
|
||||
현실에서 BFS 는 hop 수 최소화일 뿐 latency / distance 최단을 보장하지
|
||||
않는다. SPEC R1 + R2 의 결정적·정확한 routing 요구에 어긋남.
|
||||
|
||||
### A4. helper API 를 클래스 메서드가 아닌 모듈 함수로
|
||||
|
||||
기각. 두 클래스 (`AddressResolver`, `PathRouter`) 가 각각 cache 상태
|
||||
(`_node_ids`, `_hbm_slice_bytes`, 4 adjacency graphs) 를 보유해야 하며,
|
||||
같은 graph 인스턴스에 여러 routing 질의가 발생한다. 모듈 함수는 매 호출
|
||||
시 state 를 다시 만들거나 global 로 두어야 해서 안전성/성능 저하.
|
||||
|
||||
## Consequences
|
||||
|
||||
- 컴포넌트 / probe / IPCQ install / runtime API 가 모두 router.py 의
|
||||
helper 만 호출하면 명명 규칙 변경 (예: `.io0.` → `.iochiplet0.`) 이
|
||||
단 한 파일 수정으로 끝남 (D5).
|
||||
- D2 의 4 그래프 분리가 ADR 에 굳어져, 새 edge kind 가 추가될 때 (예:
|
||||
Inter-die UCIe link 의 새 kind) 어느 그래프에 포함시킬지 결정의 명확
|
||||
한 기준 제공.
|
||||
- D3.1 의 cube-local vs cross-cube 분기 (ADR-0017 D7) 가 명시되어, 향후
|
||||
routing 동작을 변경하려는 사람이 어느 adjacency 를 건드려야 할지 안다.
|
||||
- D6 의 consumer 목록이 명시되어, helper API 변경 시 PR review 범위가
|
||||
분명. backward-compat shim (D3.6) 의 deprecation 후보가 식별됨.
|
||||
@@ -0,0 +1,352 @@
|
||||
# ADR-0052: OpLog + MemoryStore Schemas — sim_engine internals
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
`sim_engine/op_log.py` 의 `OpRecord` 스키마와 `OpLogger` 의 record_start /
|
||||
record_end / record_copy 동작, 그리고 `sim_engine/memory_store.py` 의
|
||||
`MemoryStore` 가 사용하는 (space, addr) 주소공간 namespace 와 read/write
|
||||
의미를 명시한다. ADR-0020 (2-pass data execution) 가 두 인프라의 존재를
|
||||
선언하나, **레코드의 정확한 필드와 의미** 는 ADR-level 에서 정리되지
|
||||
않았고 ADR-0046 D3.2 (`tl.store` visibility), ADR-0023 D9 (IPCQ copy
|
||||
record) 등 여러 ADR 이 이들의 동작에 의존하고 있다.
|
||||
|
||||
## First action (제일 처음에 하는 일)
|
||||
|
||||
### `OpLogger(memory_store=None)`
|
||||
|
||||
생성 즉시 다음 3 가지 필드 초기화:
|
||||
|
||||
1. `self._records: list[OpRecord] = []` — 누적된 op record.
|
||||
2. `self._pending: dict[int, dict] = {}` — `id(msg)` 키로 partial record
|
||||
(record_start 시점에 만들어졌고 record_end 가 아직 안 온 것).
|
||||
3. `self._memory_store = memory_store` — 옵션 MemoryStore reference.
|
||||
math op 의 input 스냅샷 + dma_write 의 HBM source 스냅샷 캡처에 사용.
|
||||
|
||||
생성 시점에는 records / pending 모두 비어 있으며, `record_*` 호출이
|
||||
순차적으로 데이터를 누적한다.
|
||||
|
||||
### `MemoryStore()`
|
||||
|
||||
생성 즉시 `self._storage: dict[str, dict[int, np.ndarray]] = {}` 단 하나
|
||||
의 필드 초기화. 두 단계 dict (`space → addr → ndarray`) 이며 lazy 하게
|
||||
필요한 space 가 생길 때마다 inner dict 가 채워진다.
|
||||
|
||||
즉, **두 인프라의 첫 일은 "비어 있는 누적 buffer + space-별 sparse dict
|
||||
를 만들어 두는 것"** 이다. 첫 record / write 가 실제로 도착하면 그때
|
||||
필드가 채워지기 시작한다.
|
||||
|
||||
## Context
|
||||
|
||||
ADR-0020 (2-pass data execution) 의 D2/D5/D7 가 다음을 선언:
|
||||
|
||||
- Phase 1 (timing) 동안 `ComponentBase._on_process_start/end` hook 이
|
||||
`OpLogger.record_start/end` 를 호출하여 모든 data op 의 시간 + 메타
|
||||
데이터를 기록.
|
||||
- Phase 2 (data) 가 op_log 를 t_start 순으로 재생하여 실 데이터 결과를
|
||||
계산.
|
||||
- 데이터 페이로드 자체는 `MemoryStore` 에 (space, addr) 키로 보관.
|
||||
|
||||
ADR-0023 D9 (IPCQ atomic write), ADR-0027 (Megatron TP scratch
|
||||
overwrite 회피), ADR-0046 D3.2 (`tl.store` visibility) 등 후속 ADR 들이
|
||||
op_log 와 MemoryStore 의 동작에 의존하지만, **정확한 record 필드 / space
|
||||
이름 / 스냅샷 시점** 은 코드 grep 으로만 확인 가능하다. 본 ADR 이 이를
|
||||
정리한다.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. `OpRecord` 스키마 — 7 개 필드
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class OpRecord:
|
||||
t_start: float
|
||||
t_end: float
|
||||
component_id: str
|
||||
op_kind: str # "memory" | "gemm" | "math" | "unknown"
|
||||
op_name: str # e.g. "dma_read", "gemm_f16", "exp",
|
||||
# "TileToken/DMA_READ", "composite_gemm",
|
||||
# "ipcq_copy"
|
||||
params: dict[str, Any]
|
||||
dependency_ids: list[int] = field(default_factory=list)
|
||||
```
|
||||
|
||||
- **`t_start` / `t_end`**: SimPy 시간 (float ns). `t_start` 는 component
|
||||
가 op 를 시작한 시점, `t_end` 는 완료 시점. duration = `t_end - t_start`.
|
||||
- **`component_id`**: op 가 발생한 node id (예:
|
||||
`"sip0.cube0.pe0.pe_dma"`).
|
||||
- **`op_kind`**: 4 가지 중 하나. Phase 2 DataExecutor 가 이 값으로 분기.
|
||||
- **`op_name`**: 디버깅 / 분석용 사람-친화 이름. TileToken 일 경우
|
||||
`"TileToken/{stage_type}"` (예: `"TileToken/DMA_READ"`) 로 stage 를
|
||||
구분.
|
||||
- **`params`**: op-종속 메타데이터 dict (D3 참고).
|
||||
- **`dependency_ids`**: 현재 사용되지 않음 (default `[]`). 향후 cross-op
|
||||
dependency 추적이 필요해질 때를 위한 자리.
|
||||
|
||||
### D2. `OpLogger.records` — t_start 정렬 보장
|
||||
|
||||
```python
|
||||
@property
|
||||
def records(self) -> list[OpRecord]:
|
||||
self._records.sort(key=lambda r: r.t_start)
|
||||
return self._records
|
||||
```
|
||||
|
||||
매 접근 시 `t_start` 로 stable sort. 즉 같은 t_start 인 record 들은 insertion
|
||||
순서를 유지. ADR-0020 D5 의 "t_start stable ordering" 요구와 정합.
|
||||
|
||||
Phase 2 DataExecutor 는 항상 `records` property 를 통해 접근하므로,
|
||||
record_end 호출이 t_start 와 다른 순서로 도착해도 (예: 짧은 op 가 긴
|
||||
op 보다 늦게 시작했으나 먼저 끝남) 재정렬되어 일관된 시퀀스를 받는다.
|
||||
|
||||
### D3. op_name 별 `params` 스키마 (`_extract_op_info` 매핑)
|
||||
|
||||
#### D3.1. `op_kind="memory", op_name="dma_read"` (DmaReadCmd)
|
||||
|
||||
```python
|
||||
{"src_addr": int, "nbytes": int, "handle_id": str}
|
||||
```
|
||||
|
||||
#### D3.2. `op_kind="memory", op_name="dma_write"` (DmaWriteCmd)
|
||||
|
||||
```python
|
||||
{
|
||||
"src_space": str, # handle.space ("tcm"|"hbm"|"sram"), default "tcm"
|
||||
"src_addr": int, # handle.addr
|
||||
"shape": tuple, "dtype": str,
|
||||
"dst_space": "hbm", # DmaWrite 는 항상 HBM 으로
|
||||
"dst_addr": int,
|
||||
"nbytes": int,
|
||||
"handle_id": str,
|
||||
# record_end 시점에 src_space == "hbm" 이면 snapshot 추가 (D4)
|
||||
"snapshot": np.ndarray | None,
|
||||
}
|
||||
```
|
||||
|
||||
#### D3.3. `op_kind="gemm", op_name=f"gemm_{dtype_a}"` (GemmCmd)
|
||||
|
||||
```python
|
||||
{
|
||||
"src_a_addr": int, "src_b_addr": int, "dst_addr": int,
|
||||
"shape_a": tuple, "shape_b": tuple, "shape_out": tuple,
|
||||
"dtype_in": str, "dtype_out": str,
|
||||
"m": int, "k": int, "n": int,
|
||||
# ADR-0027: per-operand + output spaces 보존
|
||||
"src_a_space": str, "src_b_space": str, "dst_space": str,
|
||||
}
|
||||
```
|
||||
|
||||
#### D3.4. `op_kind="math", op_name=msg.op` (MathCmd; op = "exp", "sum", "add", "where" 등)
|
||||
|
||||
```python
|
||||
{
|
||||
"input_addrs": list[int], # 입력 핸들들의 addr
|
||||
"input_shapes": list[tuple],
|
||||
"input_spaces": list[str],
|
||||
"input_dtypes": list[str],
|
||||
"dst_addr": int, "dst_space": str,
|
||||
"shape_out": tuple, "dtype": str,
|
||||
"axis": int | None, # reduction 인 경우만 의미 있음
|
||||
# record_end 시점에 모든 input 의 스냅샷이 채워짐 (D4)
|
||||
"input_snapshots": list[np.ndarray | None],
|
||||
}
|
||||
```
|
||||
|
||||
#### D3.5. `op_kind="gemm" or "math", op_name=f"composite_{op}"` (CompositeCmd)
|
||||
|
||||
```python
|
||||
{
|
||||
"op": str, # "gemm" | "math"
|
||||
"out_addr": int, "out_nbytes": int,
|
||||
# op == "gemm" 인 경우 GemmCmd 와 같은 필드 추가:
|
||||
"src_a_addr": int, "src_b_addr": int,
|
||||
"shape_a": tuple, "shape_b": tuple,
|
||||
"dtype_in": str, "dtype_out": str,
|
||||
"src_a_space": str, "src_b_space": str,
|
||||
"dst_space": "hbm", "dst_addr": int, # = out_addr
|
||||
}
|
||||
```
|
||||
|
||||
`op == "gemm"` 이면 `op_kind = "gemm"`, 아니면 `"math"`. Phase 2 측에서
|
||||
GemmCmd 와 동일 path 로 재생되도록 alias.
|
||||
|
||||
#### D3.6. `op_kind="memory", op_name="ipcq_copy"` (record_copy 전용 경로)
|
||||
|
||||
```python
|
||||
{
|
||||
"src_space": str, "src_addr": int,
|
||||
"dst_space": str, "dst_addr": int,
|
||||
"shape": tuple, "dtype": str, "nbytes": int,
|
||||
"snapshot": np.ndarray | None, # 호출자가 전달, 없으면 record_copy 가 fresh read
|
||||
}
|
||||
```
|
||||
|
||||
`PE_DMA._handle_ipcq_inbound` (ADR-0023 D9) 가 이 record 를 발사하여 IPCQ
|
||||
slot 의 inbound copy 를 Phase 2 가 재생 가능하게 한다. 이 record 는
|
||||
`record_start` / `record_end` 를 거치지 않고 직접 `record_copy()` 로 push.
|
||||
|
||||
#### D3.7. `op_kind="unknown", op_name=type(msg).__name__`
|
||||
|
||||
`_extract_op_info` 가 인식 못 한 message 의 fallback. params = `{}`.
|
||||
DataExecutor 가 이 op_kind 를 만나면 skip — Phase 2 replay 에 영향 없음.
|
||||
|
||||
### D4. snapshot 캡처 시점
|
||||
|
||||
`OpLogger._memory_store` 가 set 되어 있을 때 record_end 가 다음을 수행:
|
||||
|
||||
- **math op**: 모든 input addr/shape/space/dtype 으로
|
||||
`self._memory_store.read(...)` 를 호출하여 `params["input_snapshots"]` 에
|
||||
ndarray copy 첨부. read 실패 시 None.
|
||||
- **dma_write op**: `src_space == "hbm"` 인 경우에만 source HBM 의
|
||||
스냅샷을 `params["snapshot"]` 에 첨부. TCM source 는 **명시적으로
|
||||
스킵** — TCM (PE scratch) 은 Phase 2 math/gemm 재생이 다시 채우므로,
|
||||
Phase-1-time snapshot 을 잡으면 이전 kernel 의 stale 데이터를 잡을 위험
|
||||
(ADR-0027 postmortem: TP gemm → all_reduce race).
|
||||
- **ipcq_copy**: `record_copy` 호출자가 `snapshot=token.data` 같이 in-flight
|
||||
스냅샷을 전달. 없으면 record_copy 가 fresh read 로 대체 시도.
|
||||
|
||||
스냅샷은 `.copy()` 가 호출되어 (`ndarray.copy()` 가 fresh allocation) 이후
|
||||
storage mutation 으로부터 안전. ADR-0027 의 "cross-PE Phase 2 ordering"
|
||||
race 회피의 근간.
|
||||
|
||||
`memory_store` 가 None 인 경우 (Phase 1 timing-only 모드) 스냅샷 단계는
|
||||
전부 skip. record 의 timing 정보만 보존되며 데이터 replay 는 불가능.
|
||||
|
||||
### D5. TileToken 처리 — record_start 가 stage 정보를 캡처
|
||||
|
||||
ADR-0014 D6 의 self-routing tile token (pipeline 모드) 은 stage_idx 가
|
||||
record_end 시점에 이미 advance 되어 있을 수 있다 (TileToken 이 다음
|
||||
component 로 이동하면서 next stage 의 params 를 캐시). 따라서:
|
||||
|
||||
`record_start` 가 다음을 `pending[id(msg)]["snap"]` 에 미리 저장:
|
||||
|
||||
```python
|
||||
snap["stage_type"] = stage.stage_type.name # "DMA_READ", "GEMM", 등
|
||||
snap["stage_params"] = dict(stage.params) # 시점의 params 복사본
|
||||
```
|
||||
|
||||
`record_end` 에서 이 snap 을 꺼내 params 에 merge:
|
||||
|
||||
- `params["stage_type"]` 가 final params 에 추가.
|
||||
- `stage_params` 의 key 들이 (이미 있으면 보존) merge.
|
||||
- `op_name == "TileToken"` 이면 `op_name = f"TileToken/{stage_type}"` 로
|
||||
rewrite (예: `"TileToken/DMA_READ"`) — 같은 component 에서 발생한 서로
|
||||
다른 stage 의 record 를 disambiguate.
|
||||
|
||||
이 메커니즘 덕분에 DMA_READ vs DMA_WRITE, FETCH vs STORE 가 같은 component
|
||||
(예: pe_dma) 에서 발생하더라도 reporting 측에서 구분 가능.
|
||||
|
||||
### D6. `MemoryStore` — (space, addr) 두 단계 dict
|
||||
|
||||
```python
|
||||
class MemoryStore:
|
||||
def __init__(self) -> None:
|
||||
self._storage: dict[str, dict[int, np.ndarray]] = {}
|
||||
|
||||
def write(self, space, addr, data): self._storage[space][addr] = data
|
||||
def read(self, space, addr, shape=None, dtype=None) -> np.ndarray: ...
|
||||
def has(self, space, addr) -> bool: ...
|
||||
def snapshot(self) -> MemoryStore: ...
|
||||
```
|
||||
|
||||
#### D6.1. space namespace
|
||||
|
||||
문자열 키. 표준 값:
|
||||
|
||||
- `"hbm"`: HBM 데이터 (deploy_tensor + Phase 2 dma_write 결과).
|
||||
- `"tcm"`: PE-로컬 TCM (Phase 2 math/gemm 결과).
|
||||
- `"sram"`: cube-level SRAM (ADR-0023 D9.7 IPCQ slot tier).
|
||||
|
||||
다른 space (예: `"reg"`) 도 자유롭게 허용 — `_storage` 가 lazy dict 라
|
||||
새 space 가 write 호출과 함께 자동 생성.
|
||||
|
||||
#### D6.2. address keying
|
||||
|
||||
`addr` 는 정수. **physical address (PA) 또는 virtual address (VA)** 일 수
|
||||
있다 — MemoryStore 자체는 address space 의 의미를 모르고 그저 키로 쓴다.
|
||||
Phase 1 의 `MemoryWriteMsg` 는 PA + VA 둘 다 write (`_create_tensor` 에서
|
||||
PA 로 zero-init, VA base 로도 zero-init), Phase 2 는 op_log 가 captured
|
||||
한 address 로 read/write.
|
||||
|
||||
`addr` 의 의미는 호출자가 결정한다 — `MemoryStore` 는 lookup 만 제공.
|
||||
|
||||
#### D6.3. read/write 의미 — reference store (no copy)
|
||||
|
||||
`write(space, addr, data)`: `data` ndarray 의 reference 를 저장. **copy
|
||||
하지 않음**. 호출자가 같은 ndarray 를 이후 mutate 하면 stored value 도
|
||||
변경된다.
|
||||
|
||||
`read(space, addr, shape=None, dtype=None)`: 저장된 ndarray 의 reference
|
||||
반환. `shape` 또는 `dtype` 이 제공되면:
|
||||
|
||||
- `dtype != stored.dtype`: `arr.view(np_dtype)` 로 reinterpret cast (no
|
||||
copy).
|
||||
- `shape != stored.shape`: `nbytes` 가 일치하면 `arr.reshape(shape)` (view).
|
||||
- `nbytes` 불일치: `ValueError`.
|
||||
|
||||
데이터를 안전하게 분리하려면 호출자가 `arr.copy()` 호출. ADR-0027 의
|
||||
race 회피가 op_log snapshot 단계에서 명시적 copy 를 강제하는 이유.
|
||||
|
||||
#### D6.4. `has(space, addr) -> bool`
|
||||
|
||||
해당 키의 존재 여부만 확인. 데이터 인스턴스화는 안 함.
|
||||
|
||||
#### D6.5. `snapshot() -> MemoryStore`
|
||||
|
||||
shallow copy. inner dict 의 새 인스턴스를 만들되 ndarray reference 는
|
||||
공유. Phase 2 초기화 시점에 Phase 1 의 store 를 fork 하여 Phase 2 의
|
||||
mutation 이 Phase 1 의 다른 사용처에 영향을 주지 않게 분리하는 데 사용.
|
||||
|
||||
### D7. op_log 가 SimPy 단일-스레드를 가정한다
|
||||
|
||||
`OpLogger` 의 `_records`, `_pending` 은 lock 없이 사용. SimPy 가 single-
|
||||
threaded 라 `record_start` → `record_end` 사이에 다른 thread 가 끼어들
|
||||
수 없다는 가정.
|
||||
|
||||
향후 multi-process kernbench (ADR-0047 D6) 가 도입되면 OpLogger 도 process
|
||||
별로 분리되어야 함이 명시. 단일 OpLogger 인스턴스가 multiple process 의
|
||||
record 를 받지 못한다.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. op_log 를 SQLite / parquet 같은 외부 store 로
|
||||
|
||||
기각 (현재). in-memory list 가 Phase 1 → Phase 2 의 핸드오프 latency 를
|
||||
최소화한다. 외부화는 long-running batch run 에서 의미가 있겠으나, 현재
|
||||
single-run 워크로드 에서는 overhead 만 추가.
|
||||
|
||||
### A2. snapshot 을 record_start 시점에 캡처
|
||||
|
||||
기각. record_start 시점은 input 이 아직 채워지지 않은 상황 (예: math
|
||||
op 의 input 이 직전 op 의 output 일 때) 이 흔하다. record_end 가 정확한
|
||||
시점.
|
||||
|
||||
### A3. MemoryStore 를 component-별 store 로 분리
|
||||
|
||||
기각. (space, addr) 키가 이미 충분히 disambiguation 을 제공하며, component
|
||||
별 분리는 cross-PE IPCQ copy (ADR-0023 D9) 가 source/destination 양쪽
|
||||
store 를 접근해야 하는 케이스를 복잡하게 만든다.
|
||||
|
||||
### A4. op_log 에 cross-op dependency edge 명시
|
||||
|
||||
부분 채택. `dependency_ids` 필드가 OpRecord 에 자리 잡고 있지만 현재
|
||||
사용되지 않음 (D1). Phase 2 DataExecutor 가 t_start 정렬 + secondary sort
|
||||
(memory ops before math at same t_start) 로 ordering 을 결정하며, 명시적
|
||||
dependency graph 가 필요해지면 이 필드가 채워질 자리. 현재는 ordering rule
|
||||
이 충분하므로 미사용.
|
||||
|
||||
## Consequences
|
||||
|
||||
- ADR-0020 의 op_log / MemoryStore 선언이 D1–D6 의 구체 schema 로 확장
|
||||
되어, Phase 2 DataExecutor 작성/수정 시 정확한 필드 의미를 grep 없이
|
||||
ADR 에서 확인 가능.
|
||||
- D3 의 op_name 별 params 스키마가 명시되어, 새 op (예: 새 reduction
|
||||
type) 추가 시 `_extract_op_info` 분기 어디에 끼울지 명확.
|
||||
- D4 의 snapshot 시점 차이 (math = input snapshot, dma_write = HBM-only
|
||||
snapshot) 가 ADR 에 굳어져, ADR-0027 의 cross-PE race 회피 결정이 향후
|
||||
refactor 에서 silently 깨지지 않음.
|
||||
- D6.3 의 reference-store 의미가 명시되어, 호출자가 mutation safety 책임
|
||||
을 인지. ADR-0027 의 explicit `.copy()` 패턴이 정당화됨.
|
||||
- D7 의 single-thread 가정이 명시되어, multi-process kernbench (ADR-0047
|
||||
D6 supersession 후보) 도입 시 OpLogger 분리가 필요함이 분명.
|
||||
@@ -0,0 +1,307 @@
|
||||
# ADR-0053: Topology Builder + Visualizer Algorithms
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
`topology/builder.py`, `topology/mesh_gen.py`, `topology/visualizer.py` 가
|
||||
함께 수행하는 토폴로지 컴파일·시각화 파이프라인의 핵심 알고리즘 선택
|
||||
(placement-driven router attachment, mesh auto-layout, source_hash 캐시,
|
||||
view projection, SVG rendering) 을 명시한다. ADR-0006 가 topology
|
||||
compilation 의 high-level intent (compiled topology, distance extraction,
|
||||
automatic diagram generation) 를 정의하나, **builder 가 실제로 어떤
|
||||
알고리즘을 사용하는지** 는 코드 grep 으로만 확인 가능했다.
|
||||
|
||||
## First action (제일 처음에 하는 일)
|
||||
|
||||
`resolve_topology(path_str)` 가 호출되면 다음 4 단계가 순서대로 일어난다:
|
||||
|
||||
1. **경로 검증** (`builder.py::resolve_topology`):
|
||||
`Path(path_str).expanduser().resolve()`, 존재 확인, file 여부 확인.
|
||||
실패 시 `FileNotFoundError` 또는 `ValueError`.
|
||||
2. **YAML 파싱** (`_read_spec`): `yaml.safe_load`. parse error 면 line/
|
||||
column 정보 포함한 `ValueError`. dict 가 아니면 reject.
|
||||
3. **mesh 자동 생성** (`mesh_gen.ensure_mesh_file`): topology yaml 과
|
||||
같은 디렉터리에 `cube_mesh.yaml` 을 만들거나 (캐시 invalid 시) 재사용
|
||||
(캐시 hit 시). 이 단계가 cube NoC 의 라우터 grid 와 부착 정보를 결정.
|
||||
4. **graph 컴파일** (`_compile_graph`): system → IO chiplets → cubes →
|
||||
inter-cube edges → IO↔cube edges → system↔IO edges 순으로 nodes/edges
|
||||
를 누적, 그 다음 4 개의 view projection (system, sip, cube, pe) 을
|
||||
생성하여 `TopologyGraph` 로 묶음.
|
||||
|
||||
즉, **topology compile 의 첫 일은 "topology.yaml 을 dict 로 읽고, 동일
|
||||
디렉터리에 cube_mesh.yaml 을 생성/검증한 뒤, system→sip→cube→pe 순으로
|
||||
flat graph + 4-view projection 을 만드는 것"** 이다.
|
||||
|
||||
## Context
|
||||
|
||||
`topology/` 패키지의 책임:
|
||||
|
||||
- **builder.py** (1207 줄): topology.yaml 을 받아 `TopologyGraph` (nodes
|
||||
+ edges + 4 view projections) 를 컴파일.
|
||||
- **mesh_gen.py** (305 줄): cube NoC 의 라우터 grid 와 PE/UCIe/M_CPU/SRAM
|
||||
부착 위치를 자동 결정하여 `cube_mesh.yaml` 로 캐시.
|
||||
- **visualizer.py** (887 줄): `TopologyGraph` 로부터 SVG 다이어그램 4종
|
||||
(system / sip / cube / pe) 을 생성.
|
||||
|
||||
ADR-0006 가 "topology compilation 의 결과는 distance metadata 와 diagram
|
||||
generation 의 single source" 라는 high-level 결정을 정의하나, 구체 알고리즘
|
||||
(예: placement-driven nearest-router attachment, HBM 제외 zone 산출,
|
||||
source_hash 의 어떤 필드가 invalidation 을 트리거하는가) 은 ADR 에 없다.
|
||||
|
||||
특히 다음 결정들이 ADR-level 에 부재:
|
||||
|
||||
- 왜 mesh_gen 이 별도 파일 (`cube_mesh.yaml`) 로 캐시되는가?
|
||||
- source_hash 가 어떤 필드를 포함하며, 어떤 변경이 재생성을 강제하는가?
|
||||
- placement coordinate 가 cube 좌표가 아닌 mm 단위인 이유?
|
||||
- HBM zone 제외와 UCIe N/S/E/W 분배가 mesh 안에서 어떻게 결정되는가?
|
||||
- view projection 4 개 (system/sip/cube/pe) 의 추상화 레벨 차이?
|
||||
|
||||
이 ADR 이 이 결정들을 한 곳에 정리한다.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. compile 파이프라인 — 6 단계
|
||||
|
||||
`_compile_graph(spec)`:
|
||||
|
||||
1. **시스템 노드 생성** (`_instantiate_system`): `fabric.switch0`, host CPU
|
||||
등 system-level 노드 추가.
|
||||
2. **per-SIP loop** (`for sip_id in range(system.sips.count)`):
|
||||
- **IO chiplets** (`_instantiate_io_chiplets`): pcie_ep / io_cpu /
|
||||
io_noc / io_ucie PHY / conn 노드 + 내부 양방향 edge 생성.
|
||||
- **cube instantiation** (`_instantiate_cube`): cube_mesh.yaml 의 router
|
||||
grid 를 토대로 cube-별 라우터, PE sub-components (pe_cpu, pe_dma,
|
||||
pe_fetch_store, pe_gemm, pe_math, pe_mmu, pe_tcm, pe_scheduler,
|
||||
pe_ipcq), m_cpu, sram, hbm_ctrl 인스턴스화 + 내부 edge 깔기.
|
||||
- **inter-cube edges** (`_add_inter_cube_edges`): UCIe N/S/E/W mesh
|
||||
edge.
|
||||
- **IO ↔ cube edges** (`_add_io_to_cube_edges`): io_noc 와 cube 의
|
||||
edge UCIe phy 사이 연결.
|
||||
3. **switch ↔ IO edges** (`_add_system_to_io_edges`): `fabric.switch0`
|
||||
와 각 SIP 의 `pcie_ep` 사이 양방향 edge (ADR-0038 D3 + ADR-0010 의
|
||||
cross-SIP IPCQ 경로).
|
||||
4. **view projections** 4 종 build:
|
||||
- `_build_system_view(spec)` — Tray 레벨, SIP 들과 system switch.
|
||||
- `_build_sip_view(spec)` — SIP 안의 cube mesh + IO chiplet.
|
||||
- `_build_cube_view(spec)` — 단일 cube 안의 router grid + PE/M_CPU/SRAM/
|
||||
HBM_CTRL 부착.
|
||||
- `_build_pe_view(spec)` — 단일 PE 안의 9 sub-components + 내부 edge.
|
||||
5. **TopologyGraph 리턴**: `TopologyGraph(spec, nodes, edges, system_view,
|
||||
sip_view, cube_view, pe_view)`.
|
||||
|
||||
이 6 단계는 **순서가 의미를 가진다**: cubes 가 만들어진 후에야 inter-cube
|
||||
edges 가 valid 한 src/dst 를 갖고, IO chiplet 이 먼저 만들어져야 IO ↔ cube
|
||||
edge 가 그를 참조할 수 있다. 새 노드 종류를 끼울 때는 의존 관계를 보고
|
||||
적절한 위치에 삽입해야 한다.
|
||||
|
||||
### D2. `cube_mesh.yaml` — 별도 파일 + source_hash 캐시
|
||||
|
||||
`mesh_gen.ensure_mesh_file(cube_spec, mesh_path)`:
|
||||
|
||||
1. `source_hash = _compute_source_hash(cube_spec)` 산출. 입력 필드:
|
||||
- `geometry` (cube_mm.w/h 등).
|
||||
- `pe_layout` (corners, pe_per_corner).
|
||||
- `ucie.n_connections`.
|
||||
- `memory_map.hbm_mapping_mode`.
|
||||
- `placement` (m_cpu/sram pos_mm).
|
||||
2. `mesh_path` (= `topology.yaml` 와 같은 디렉터리의 `cube_mesh.yaml`) 이
|
||||
존재하고 `existing.source_hash == source_hash` 면 재사용 (캐시 hit).
|
||||
3. 아니면 `_generate_mesh(cube_spec, source_hash)` 로 새 mesh 생성 후
|
||||
yaml 로 저장.
|
||||
|
||||
별도 파일로 캐시하는 이유:
|
||||
|
||||
- mesh 생성은 PE/UCIe/router 부착 계산이 들어가 매번 다시 하기 무거움.
|
||||
- 같은 cube spec 으로 여러 번 실행 시 동일 mesh 가 보장되어야 함.
|
||||
- 사람이 직접 mesh 를 inspect / debug 할 수 있는 artifact 가 됨.
|
||||
|
||||
`source_hash` 가 list 한 5 개 필드가 mesh 형상을 결정하는 핵심이며, 그
|
||||
외 (예: bandwidth, overhead_ns) 변경은 mesh 재생성을 트리거하지 않는다.
|
||||
|
||||
### D3. cube NoC mesh auto-layout 알고리즘
|
||||
|
||||
`_generate_mesh(cube_spec)`:
|
||||
|
||||
#### D3.1. 행/열 결정
|
||||
|
||||
- `pe_positions = _corner_pe_positions(cube_w, cube_h)`: 4 corner (NW/NE/
|
||||
SW/SE) 마다 PE center 좌표 (mm). hardcoded `(1.5, 1.5)` / `(cube_w-1.5,
|
||||
cube_h-1.5)` 패턴 + `pe_per_corner=2` 면 각 corner 에 2 PE 위치.
|
||||
- `col_xs = _compute_col_positions(...)`: PE 들의 x 좌표 union + `max_spacing
|
||||
= 3.0 mm` 보다 큰 gap 에 relay 컬럼 삽입.
|
||||
- `row_ys, rows_per_half = _compute_row_positions(cube_h, n_connections,
|
||||
pe_positions)`:
|
||||
- `n_conn = max(n_connections, 2)` (hot path minimum).
|
||||
- `rows_per_half = ceil(n_conn / 2)`.
|
||||
- top 절반 + HBM 두 row + bottom 절반. HBM 은 `(cube_h/2 - 1.5, cube_h/2
|
||||
+ 1.5)` 에 위치. PE rows 와 HBM rows 사이 `hbm_gap = 1.5 mm`.
|
||||
|
||||
#### D3.2. HBM 제외 zone
|
||||
|
||||
`hbm_row_start = rows_per_half`, `hbm_row_end = rows_per_half + 1`.
|
||||
`hbm_col_start = n_cols // 2 - 1`, `hbm_col_end = n_cols // 2`.
|
||||
|
||||
이 (row, col) 사각형 안의 router 슬롯은 `None` 으로 마킹 (라우터 없음).
|
||||
실제 HBM 컨트롤러는 별도 `hbm_ctrl.pe{X}` 노드로 ADR-0017 D9 의 per-PE
|
||||
파티션 패턴을 따라 부착.
|
||||
|
||||
#### D3.3. PE 부착
|
||||
|
||||
각 corner 의 PE 들은 다음 row 에 매핑:
|
||||
|
||||
- Top half: NW → row 0, NE → row 1 (top_corners 안의 index).
|
||||
- Bottom half: SW → row `hbm_row_end + 1`, SE → row `hbm_row_end + 2`.
|
||||
|
||||
각 PE 의 x 좌표가 가장 가까운 col 의 router 에 부착 (`min(range(n_cols),
|
||||
key=lambda c: abs(col_xs[c] - pe_x))`). 부착 항목은 `pe{pe_idx}.dma`,
|
||||
`pe{pe_idx}.cpu`, `pe{pe_idx}.hbm` 세 가지 (router 별 attach list 에 push).
|
||||
|
||||
#### D3.4. M_CPU / SRAM 부착 — nearest router by Euclidean distance
|
||||
|
||||
`placement.m_cpu.pos_mm` (default `[1.5, 5.5]`) 와 `placement.sram.pos_mm`
|
||||
(default `[1.5, 8.5]`) 의 좌표에서 가장 가까운 router 를 Euclidean
|
||||
distance 로 찾아 attach list 에 `"m_cpu"` / `"sram"` 추가.
|
||||
|
||||
#### D3.5. UCIe N/S/E/W 분배
|
||||
|
||||
`ucie_pe_rows = top_pe_rows + bot_pe_rows` (총 `2 * rows_per_half` 개).
|
||||
|
||||
- UCIe-E: 매 PE row 마다 rightmost col 의 router 에 `ucie_e.c{i}`.
|
||||
- UCIe-W: leftmost col 의 router 에 `ucie_w.c{i}` (E 의 mirror).
|
||||
- UCIe-N/S: PE column 들 중 절반을 좌측, 절반을 우측으로 나눠 top row /
|
||||
bottom row 의 해당 col 에 부착.
|
||||
|
||||
각 UCIe connection 은 `c{i}` index 가 붙어 ucie_n_connections 만큼의 PHY
|
||||
가 분산된다 (ADR-0017 D5+).
|
||||
|
||||
### D4. node 명명 규칙 — 단일 소유자
|
||||
|
||||
builder.py 는 다음 명명 규칙으로 노드를 만든다 (ADR-0051 D5 의 단일
|
||||
소유자 원칙):
|
||||
|
||||
- `fabric.switch0` — system-level switch.
|
||||
- `sip{S}.{io_id}.{pcie_ep|io_cpu|io_noc|io_ucie.{dir}|conn.{id}}` — IO
|
||||
chiplet.
|
||||
- `sip{S}.cube{C}.{m_cpu|sram|hbm_ctrl.pe{X}|noc.r{R}c{C}|...}` — cube 내부.
|
||||
- `sip{S}.cube{C}.pe{P}.{pe_cpu|pe_dma|pe_fetch_store|pe_gemm|pe_math|pe_mmu|pe_tcm|pe_scheduler|pe_ipcq}` — PE sub-components.
|
||||
|
||||
이 명명 규칙을 변경하려면 builder.py 와 router.py (ADR-0051) 의 helper
|
||||
양쪽이 함께 갱신되어야 한다. 컴포넌트는 명명 규칙을 직접 알지 못하고
|
||||
helper 만 호출한다.
|
||||
|
||||
### D5. edge `kind` 분류
|
||||
|
||||
각 edge 가 부여받는 `kind` 가 라우팅 정책 (ADR-0051 D2) 의 입력. 주요
|
||||
kind 값:
|
||||
|
||||
- `"pe_internal"` — PE 내부 sub-component 간.
|
||||
- `"pe_to_router"` — PE_DMA ↔ cube NoC router.
|
||||
- `"router_mesh"` — cube NoC router 간.
|
||||
- `"router_to_hbm"`, `"router_to_mcpu"`, `"router_to_sram"`,
|
||||
`"sram_to_router"` 등 — cube-attached component 간.
|
||||
- `"ucie_internal"`, `"ucie_conn_to_router"`, `"router_to_ucie_conn"`,
|
||||
`"ucie_conn_to_noc"`, `"noc_to_ucie_conn"`, `"ucie_mesh"` — UCIe 관련.
|
||||
- `"io_internal"` — IO chiplet 내부.
|
||||
- `"io_to_cube"`, `"cube_to_io"` — IO ↔ cube 경계.
|
||||
- `"pcie"` — switch ↔ pcie_ep.
|
||||
- `"command"` — control-plane only edges (M_CPU ↔ NOC 등; PE DMA path 에서
|
||||
제외).
|
||||
|
||||
새 edge kind 를 추가하면 router.py 의 4 adjacency graph (ADR-0051 D2) 의
|
||||
어느 카테고리에 속할지 결정해야 한다 — 그렇지 않으면 default 로 `_adj_all`
|
||||
에만 포함되어 의도와 다른 routing 발생 가능.
|
||||
|
||||
### D6. view projection — 4 추상화 레벨
|
||||
|
||||
`TopologyGraph` 는 flat (nodes + edges) 외에 4 개의 view projection 을
|
||||
보유:
|
||||
|
||||
- **system_view** (`_build_system_view`): Tray 레벨. SIP 박스들 + `fabric.
|
||||
switch0`. PCIE 링크 표시. 외부 발표용 high-level overview.
|
||||
- **sip_view** (`_build_sip_view`): 한 SIP 안. cube mesh + IO chiplet
|
||||
(pcie_ep + io_cpu + io_noc). UCIe N/S/E/W 가 cube 간 연결로 보임.
|
||||
- **cube_view** (`_build_cube_view`): 한 cube 안. router grid + PE/M_CPU/
|
||||
SRAM/HBM_CTRL 부착 + UCIe PHY edge 부분. cube 내부 라우팅 / placement
|
||||
진단용.
|
||||
- **pe_view** (`_build_pe_view`): 한 PE 안. 9 sub-components + 내부 edge
|
||||
(pe_internal kind). 자세한 PE 내부 dataflow 검토용.
|
||||
|
||||
view 는 spec 에서 `visualization.emit_views: [system, sip, cube]` 같이
|
||||
선택적으로 출력 (ADR-0006). pe view 는 기본 출력에서 빠져 있으나 코드는
|
||||
유지 (자세한 디버그용).
|
||||
|
||||
### D7. visualizer.py — SVG 다이어그램 출력
|
||||
|
||||
`emit_diagrams(graph, out_dir)` 가 모든 view 를 SVG 로 렌더. 핵심 함수:
|
||||
|
||||
- `_render_view_svg(view)` — 일반적인 view 렌더 (router grid 가 없는
|
||||
경우).
|
||||
- `_render_cube_view_svg(view, spec)` — cube view 전용 (HBM block 그리기,
|
||||
router grid layout, PE/M_CPU/SRAM/HBM positioning).
|
||||
- `_draw_node`, `_draw_edge` — 노드 / edge 의 시각적 표현.
|
||||
- `_pick_scale`, `_compute_node_sizes` — 자동 스케일링.
|
||||
|
||||
visualizer 는 **derived artifact** (ADR-0006) 로 분류되며, 코드 변경 시
|
||||
production check 대상이 아니다. CLAUDE.md 의 "Derived Artifacts" 항목과
|
||||
정합.
|
||||
|
||||
### D8. spec 변경의 영향 범위
|
||||
|
||||
| spec 필드 | 영향 | mesh 재생성 |
|
||||
|---------------------------------------|-------------------|-------------|
|
||||
| `system.sips.count` | SIP 갯수, node 수 | No |
|
||||
| `sip.cube_mesh.w/h` | cube mesh 형상 | No |
|
||||
| `cube.geometry.cube_mm.w/h` | cube 크기 (mm) | **Yes** |
|
||||
| `cube.pe_layout.corners/pe_per_corner`| PE 부착 위치 | **Yes** |
|
||||
| `cube.ucie.n_connections` | UCIe PHY 분배 | **Yes** |
|
||||
| `cube.memory_map.hbm_mapping_mode` | HBM 분배 모드 | **Yes** |
|
||||
| `cube.placement` | M_CPU/SRAM 위치 | **Yes** |
|
||||
| `cube.memory_map.*` (위 제외) | HBM 용량 / BW | No |
|
||||
| `*.links.*.bw_gbs` | edge bandwidth | No |
|
||||
| `*.attrs.overhead_ns` | 컴포넌트 latency | No |
|
||||
|
||||
위 표가 D2 의 `_compute_source_hash` 입력과 일치. mesh 재생성이 필요한
|
||||
변경은 `cube_mesh.yaml` 의 source_hash 가 자동 invalidate.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. mesh 를 별도 캐시 파일 없이 매 compile 시 재생성
|
||||
|
||||
기각. 같은 spec 으로 여러 번 호출되는 케이스 (CLI run, probe, test) 마다
|
||||
mesh 생성 비용을 다시 지불. 또한 사람이 mesh 를 inspect 할 수 있는 artifact
|
||||
가 사라짐.
|
||||
|
||||
### A2. mesh 생성을 builder.py 에 합치기
|
||||
|
||||
기각 (현재). 305 줄 짜리 자체 알고리즘이며, mesh layout 의 결정 (placement-
|
||||
driven router attachment, HBM exclusion zone) 이 builder 의 일반적인
|
||||
node/edge 생성 책임과 다르다. 분리 유지가 단일 책임 원칙에 더 부합.
|
||||
|
||||
### A3. placement coordinate 를 cube 좌표 (col/row) 로 표현
|
||||
|
||||
기각. mm 단위 좌표가 시각화 측 (visualizer) 과 mesh layout 측 (nearest-
|
||||
router 산출) 양쪽에서 일관되게 쓰인다. cube 좌표는 router grid 가 결정
|
||||
되기 전까지는 정의되지 않으므로 placement 입력에 부적절.
|
||||
|
||||
### A4. view projection 을 lazy 하게 생성
|
||||
|
||||
기각 (현재). 4 개 view 의 생성 비용이 작고 (보통 < 100 ms), eager 생성이
|
||||
`TopologyGraph` 를 통한 single source of truth 를 보장.
|
||||
|
||||
### A5. visualizer 출력 형식을 SVG 외 (PNG/PDF) 도
|
||||
|
||||
기각. SVG 가 vector + 텍스트 검색 가능 + 브라우저 직접 렌더가 가능한 가장
|
||||
유연한 형식. PNG 변환이 필요하면 별도 도구 (rsvg-convert 등) 로 후처리.
|
||||
|
||||
## Consequences
|
||||
|
||||
- ADR-0006 의 high-level intent 가 D1–D7 로 구체화되어, topology 변경
|
||||
영향을 D8 표로 빠르게 가늠 가능.
|
||||
- D3 의 mesh auto-layout 알고리즘이 ADR-level 에서 굳어져, 추후 새 PE
|
||||
부착 패턴 (예: HBM 의 6-zone 분할) 도입 시 어느 단계가 영향받는지 명확.
|
||||
- D5 의 edge kind 목록과 D7 의 view 구조가 명시되어, 새 component 종류
|
||||
추가 시 (builder + router + visualizer) 어디까지 손대야 하는지 PR
|
||||
reviewer 가 한눈에 파악 가능.
|
||||
- D2 의 source_hash invalidation 규칙이 명시되어, cube_mesh.yaml 이 stale
|
||||
하게 남는 경우 (예: bw 값만 바꿨을 때) 가 정상 동작임이 분명.
|
||||
@@ -0,0 +1,138 @@
|
||||
# ADR-0054: 마일스톤 평가 bench — 자기완결적 sweep + figure bench
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
ADR-0044(D1/D2)와 ADR-0045(D5)를 개정하고, ADR-0043/0044의 "로직이
|
||||
`scripts/` + `tests/`에 산다" 배치를 대체한다: GEMM/allreduce 평가
|
||||
하니스가 이제 사용자가 실행하여 모든 결과 + figure를 재생성하는
|
||||
자기완결적 **bench**가 된다.
|
||||
|
||||
## Context
|
||||
|
||||
ADR-0043(allreduce 평가)과 ADR-0044(GEMM 평가)는 각 하니스를 **sweep**
|
||||
(수동 `scripts/` 드라이버, 또는 allreduce의 경우 parametrized 테스트
|
||||
자체) + committed 데이터를 렌더링하는 **figure 테스트**로 분리했다.
|
||||
따라서 sweep/render 로직은 `scripts/gemm_sweep.py`,
|
||||
`tests/gemm/_gemm_plot_helpers.py`, `tests/sccl/_allreduce_helpers.py`에
|
||||
존재했다.
|
||||
|
||||
마일스톤 요구사항("사용자가 *하나의 bench*를 실행해 모든 결과와 플롯을
|
||||
생성하도록 allreduce + GEMM 평가를 리팩터")은 그 배치로는 충족 불가다:
|
||||
bench는 production 코드이며 **`tests/`를 import할 수 없다**(ADR-0007 레이어
|
||||
방향). 평가 로직은 bench에서 닿을 수 있도록 production으로 이동해야 했다.
|
||||
|
||||
선택한 home은 별도 `kernbench.eval` 패키지가 아니라 bench 모듈 자체다.
|
||||
bench 파일은 임의의 모듈 레벨 코드를 가질 수 있으며, 하니스를 bench로
|
||||
합치면 도메인당 파일 하나가 유지되고 패키지 레이어가 하나 줄어든다.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. 두 마일스톤 bench가 평가 로직을 보유
|
||||
|
||||
- `src/kernbench/benches/milestone_1h_gemm.py` — GEMM shape×variant sweep
|
||||
+ 세 figure renderer(`scripts/gemm_sweep.py` +
|
||||
`tests/gemm/_gemm_plot_helpers.py`에서 이동).
|
||||
- `src/kernbench/benches/milestone_1h_ccl.py` — distributed allreduce
|
||||
드라이버, latency + buffer-kind sweep, topology diagram, FSIM 비교, 그리고
|
||||
direct-launch 패리티 레퍼런스(`tests/sccl/_allreduce_helpers.py`에서 이동).
|
||||
|
||||
각 파일은 해당 도메인 평가 로직의 **단일 home**이다.
|
||||
|
||||
### D2. "평가 bench" 패턴 (ADR-0045 D5 확장)
|
||||
|
||||
ADR-0045 D5는 bench를 단일 구성(single-SIP, 또는 ADR-0024 multi-SIP CCL
|
||||
예외)으로 고정했다. 본 ADR은 세 번째 패턴을 추가한다:
|
||||
|
||||
- **평가 bench**는 *여러* 구성을 구동하고 figure를 렌더링할 수 있다. 외부
|
||||
`run_bench` 엔진 대신 sweep 지점마다 자체 `GraphEngine` /
|
||||
`RuntimeContext`를 빌드한다.
|
||||
- 그러면 외부 ctx에 제출된 handle이 없으므로, bench는 마지막에
|
||||
**sentinel 텐서**(`torch.zeros((1, 1), …)`)를 제출하여 `run_bench`의
|
||||
"최소 한 번 제출" 계약(ADR-0045 D4)을 만족시키고 CLI가 0으로 종료되게
|
||||
한다.
|
||||
|
||||
### D3. 출력 위치
|
||||
|
||||
두 bench 모두 `src/kernbench/benches/1H_milestone_output/{gemm,ccl}/`에
|
||||
쓴다(사용자 요청 — bench 옆 아티팩트). 디렉터리는 생성된 PNG/CSV/JSON만
|
||||
보유하며(`.py`/`__init__.py` 없음), 따라서 eager-import audit(ADR-0045
|
||||
첫 동작)이 무시한다 — `pkgutil.iter_modules`는 비-패키지 하위 디렉터리를
|
||||
yield하지 않는다. `docs/diagrams/` 아티팩트처럼 **커밋된다**(원격에서
|
||||
figure를 볼 수 있도록); bench 재실행 시 제자리에서 재생성된다.
|
||||
|
||||
### D4. GEMM 무거운 sweep — 기본은 fresh, `MILESTONE_FAST`로 재사용
|
||||
|
||||
`milestone-1h-gemm`은 기본적으로 전체 24-sim sweep을 실행한다(분 단위;
|
||||
한 shape는 2048 tile). `MILESTONE_FAST=1`은 committed
|
||||
`docs/diagrams/gemm_sweep.json`을 재사용하고 렌더링만 한다(초 단위). 이는
|
||||
ADR-0044 D1/D2의 "무거운 sweep은 수동/`slow` 단계로 유지"를 뒤집는다:
|
||||
bench 실행이 곧 재생성이다. slow 경로는 `@pytest.mark.slow` bench
|
||||
테스트로 행사되고, fast 경로는 기본 실행된다.
|
||||
|
||||
### D5. 테스트 + 스크립트는 thin re-export shim으로 재사용 (단일 home 유지)
|
||||
|
||||
기존 figure 테스트와 `scripts/gemm_sweep.py` 진입점은 유지되며 이제 bench
|
||||
모듈을 재사용한다:
|
||||
|
||||
- `tests/gemm/_gemm_plot_helpers.py` → renderer +
|
||||
`GEMM_SWEEP_JSON`/`GEMM_PLOTS_DIR`/`ROOT`를
|
||||
`kernbench.benches.milestone_1h_gemm`에서 re-export.
|
||||
- `tests/sccl/_allreduce_helpers.py` → 드라이버 코어, config writer, sweep
|
||||
상수, renderer, disk aggregator를 `kernbench.benches.milestone_1h_ccl`에서
|
||||
re-export하고, **pytest 전용** 조각은 로컬 유지: `pytest.param` 행렬
|
||||
(`CONFIGS` / `_sweep_params` / `_bk_params`)과 fixture 결합
|
||||
`_run_distributed`(`monkeypatch.chdir` + `_drive_distributed`) wrapper.
|
||||
- `scripts/gemm_sweep.py` → bench의 `run_sweep` 위 thin wrapper.
|
||||
|
||||
테스트가 bench 모듈을 import하는 것은 허용된다(테스트는 production 위에
|
||||
위치, ADR-0007); 이는 전체 패키지 eager audit을 유발하며, 그것은 이미 매
|
||||
`kernbench` 실행 시 동작한다. matplotlib는 renderer 내부에서 lazy import로
|
||||
유지되어 audit의 startup 비용은 불변이다.
|
||||
|
||||
### D6. 평면 모듈 네이밍 (`benches/` 하위 폴더 없음)
|
||||
|
||||
`1H_milestone…`로 명명된 `benches/` 하위 패키지는 불가능하다 — Python
|
||||
패키지 이름은 숫자로 시작할 수 없다. 따라서 bench는 평면 모듈
|
||||
`milestone_1h_gemm.py` / `milestone_1h_ccl.py`이며 bench 이름은
|
||||
`milestone-1h-gemm` / `milestone-1h-ccl`(kebab-case, ADR-0045 D1에 따라
|
||||
글자로 시작)이다.
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- `kernbench run --bench milestone-1h-gemm`(또는 `…-ccl`)이 도메인의 모든
|
||||
결과 + figure를 한 명령으로 재생성한다 — 마일스톤 요구사항.
|
||||
- 평가 로직의 단일 소스(bench), shim을 통해 테스트와 스크립트가 재사용;
|
||||
중복 없음.
|
||||
- figure 테스트와 `scripts/gemm_sweep.py`는 변경 없이 계속 동작.
|
||||
|
||||
### Negative / limitations
|
||||
|
||||
- 두 bench 파일이 크다(CCL 쪽은 distributed 드라이버, sweep, matplotlib
|
||||
드로잉을 섞는다). 대부분 평가 하니스인 "bench"는 이례적이며, 본 ADR이
|
||||
이를 정당화한다.
|
||||
- 생성 아티팩트가 명시적 요청에 의해 source tree(`src/kernbench/benches/`)
|
||||
안에 살며 커밋된다(원격에서 figure를 볼 수 있도록); bench 재실행 시
|
||||
재생성된다.
|
||||
- `milestone-1h-ccl`(및 기본 `milestone-1h-gemm`)은 분 단위 소요 —
|
||||
on-demand 마일스톤 아티팩트에는 수용 가능, 일상 실행에는 아님.
|
||||
|
||||
## Dependencies
|
||||
|
||||
- **ADR-0007**: 레이어 방향(테스트는 production을 import할 수 있으나 bench는
|
||||
테스트를 import할 수 없는 이유).
|
||||
- **ADR-0043 / ADR-0044**: 본 ADR이 bench로 이전하는 allreduce / GEMM 평가
|
||||
하니스.
|
||||
- **ADR-0045**: bench 모듈 계약; 여기 D2가 그 D5(single-device 규칙)를
|
||||
평가-bench 패턴으로 확장하고, sentinel을 위해 D4(NO_REQUESTS)에 의존.
|
||||
- **ADR-0024**: allreduce sweep이 구동하는 rank = SIP launcher.
|
||||
|
||||
## Open questions
|
||||
|
||||
- GEMM theoretical 모델 상수(ADR-0044 D5)를 복사 대신 ADR-0033/0014에서
|
||||
소싱해야 하는가? 본 ADR로는 불변.
|
||||
- `build_overview_slides.py`가 GEMM 막대를 네이티브로 그리는 대신 마일스톤
|
||||
출력 PNG를 소비해야 하는가? 여전히 open(ADR-0044 D6 / Negative).
|
||||
@@ -0,0 +1,175 @@
|
||||
# ADR Index
|
||||
|
||||
Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **47**.
|
||||
|
||||
Classification mirrors the `/report` skill's section assignment. When adding a new ADR, also add an entry to the `CLASSIFICATION` table in `tools/generate_adr_index.py`.
|
||||
|
||||
## Design Principles
|
||||
|
||||
- [ADR-0013](./ADR-0013-ver-verification-strategy.md) — 검증 전략 및 Phase 1 테스트 계획
|
||||
- [ADR-0033](./ADR-0033-lat-latency-model-assumptions.md) — 레이턴시 모델: 가정 및 알려진 단순화
|
||||
|
||||
## High-level Architecture
|
||||
|
||||
- [ADR-0003](./ADR-0003-dev-target-system-hierarchy.md) — 타겟 시스템 계층 및 모델링 범위 _(System hierarchy (Tray / SIP / CUBE / PE))_
|
||||
- [ADR-0007](./ADR-0007-api-runtime-api-boundaries.md) — 런타임 API 및 시뮬레이션 엔진 경계 _(Runtime API ↔ sim_engine boundaries)_
|
||||
- [ADR-0016](./ADR-0016-dev-iochiplet-noc-and-memory-path.md) — IOChiplet NoC와 메모리 데이터 경로 _(IOChiplet NOC and memory data path)_
|
||||
- [ADR-0017](./ADR-0017-dev-cube-noc-and-hbm-connectivity.md) — 큐브 NoC와 HBM 연결성 _(Cube NOC and HBM connectivity)_
|
||||
|
||||
## Detailed Architecture
|
||||
|
||||
One subsection per component file under `src/kernbench/components/builtin/`.
|
||||
|
||||
### forwarding
|
||||
|
||||
- [ADR-0037](./ADR-0037-dev-forwarding-component.md) — Forwarding 컴포넌트 (forwarding_v1)
|
||||
|
||||
### hbm_ctrl
|
||||
|
||||
- [ADR-0034](./ADR-0034-dev-hbm-controller-internal-design.md) — HBM 컨트롤러 내부 설계
|
||||
|
||||
### io_cpu
|
||||
|
||||
- [ADR-0036](./ADR-0036-dev-io-cpu-component-model.md) — IO_CPU 컴포넌트 모델
|
||||
|
||||
### m_cpu
|
||||
|
||||
- [ADR-0035](./ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md) — M_CPU 및 M_CPU.DMA 컴포넌트 모델
|
||||
|
||||
### pcie_ep
|
||||
|
||||
- [ADR-0038](./ADR-0038-dev-pcie-ep-component-model.md) — PCIE_EP Component Model
|
||||
|
||||
### pe_cpu
|
||||
|
||||
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
|
||||
|
||||
### pe_dma
|
||||
|
||||
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
|
||||
- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
|
||||
|
||||
### pe_fetch_store
|
||||
|
||||
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
|
||||
|
||||
### pe_gemm
|
||||
|
||||
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
|
||||
|
||||
### pe_ipcq
|
||||
|
||||
- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
|
||||
|
||||
### pe_math
|
||||
|
||||
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
|
||||
|
||||
### pe_mmu
|
||||
|
||||
- [ADR-0039](./ADR-0039-dev-pe-mmu-component-model.md) — PE_MMU Component Model — 컴포넌트 + 유틸리티 이중 역할
|
||||
|
||||
### pe_scheduler
|
||||
|
||||
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
|
||||
|
||||
### pe_tcm
|
||||
|
||||
- [ADR-0040](./ADR-0040-dev-pe-tcm-component-model.md) — PE_TCM Component Model — 듀얼 채널 BW 직렬화
|
||||
|
||||
### sram
|
||||
|
||||
- [ADR-0041](./ADR-0041-dev-cube-sram-component-model.md) — Cube SRAM Component Model — terminal scratchpad on cube NoC
|
||||
|
||||
### tiling
|
||||
|
||||
- [ADR-0042](./ADR-0042-prog-tile-plan-generators.md) — Tile Plan Generators — GEMM/Math 파이프라인 plan 빌더
|
||||
|
||||
## Implementation Decisions
|
||||
|
||||
### Address Scheme
|
||||
|
||||
- [ADR-0001](./ADR-0001-mem-physaddr-layout.md) — 51비트 물리 주소 레이아웃 및 디코딩 계약
|
||||
- [ADR-0011](./ADR-0011-mem-memory-addressing-simplification.md) — 메모리 주소 지정 — PA / VA / LA 주소 모델
|
||||
|
||||
### Routing & Helper API
|
||||
|
||||
- [ADR-0002](./ADR-0002-lat-routing-distance.md) — 라우팅 거리, 순서 및 우회 규칙
|
||||
- [ADR-0051](./ADR-0051-lat-routing-helper-api.md) — Routing Helper API — `AddressResolver` + `PathRouter`
|
||||
|
||||
### Memory Semantics & Local-HBM Bandwidth
|
||||
|
||||
- [ADR-0004](./ADR-0004-mem-memory-semantics-local-hbm.md) — 메모리 시맨틱 및 로컬 HBM 대역폭 보장
|
||||
|
||||
### Topology Compilation, Diagrams & Builder Algorithms
|
||||
|
||||
- [ADR-0005](./ADR-0005-dev-diagram-views-distance-layout.md) — 다이어그램 뷰 및 거리 기반 레이아웃 규칙
|
||||
- [ADR-0006](./ADR-0006-dev-topology-compilation-distance-diagram.md) — 토폴로지 컴파일, 거리 추출, 그리고 자동 다이어그램 생성
|
||||
- [ADR-0053](./ADR-0053-dev-topology-builder-algorithms.md) — Topology Builder + Visualizer Algorithms
|
||||
|
||||
### Tensor Deployment and Allocation
|
||||
|
||||
- [ADR-0008](./ADR-0008-api-tensor-deploy-and-allocation.md) — 텐서 배포 및 할당 (호스트 할당기, PA 우선)
|
||||
|
||||
### Kernel Execution and Host-Device Messaging
|
||||
|
||||
- [ADR-0009](./ADR-0009-api-kernel-execution-messaging.md) — 커널 실행 메시징 및 완료 시맨틱
|
||||
- [ADR-0012](./ADR-0012-api-host-io-message-schema.md) — Host ↔ IO_CPU 메시지 스키마 (PA-우선, PE-태깅)
|
||||
|
||||
### CLI Surface and Semantics
|
||||
|
||||
- [ADR-0010](./ADR-0010-api-cli-surface-and-semantics.md) — 명령줄 인터페이스 및 실행 시맨틱
|
||||
|
||||
### Component Port/Wire Fabric Model
|
||||
|
||||
- [ADR-0015](./ADR-0015-dev-component-port-wire-model.md) — 컴포넌트 포트/와이어 모델과 패브릭 라우팅
|
||||
|
||||
### Two-Pass Data Execution
|
||||
|
||||
- [ADR-0020](./ADR-0020-prog-data-execution-two-pass.md) — 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)
|
||||
|
||||
### 2D Grid Program Identity
|
||||
|
||||
- [ADR-0022](./ADR-0022-prog-program-id-2d-grid.md) — 2D 그리드 program_id 시맨틱
|
||||
|
||||
### Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)
|
||||
|
||||
- [ADR-0024](./ADR-0024-par-sip-tp-launcher.md) — SIP-level Launcher — rank = SIP
|
||||
- [ADR-0026](./ADR-0026-par-dppolicy-intra-device.md) — DPPolicy = Intra-Device Only — sip/num_sips 필드 제거
|
||||
- [ADR-0027](./ADR-0027-par-megatron-tp.md) — Megatron-style Tensor Parallelism API
|
||||
- [ADR-0047](./ADR-0047-par-ahbm-ccl-backend.md) — AHBM CCL Backend — `torch.distributed`-compat shim
|
||||
- [ADR-0050](./ADR-0050-par-ccl-algorithm-module-contract.md) — CCL Algorithm Module Contract — `ccl/algorithms/*.py`
|
||||
|
||||
### IPCQ Direction Addressing
|
||||
|
||||
- [ADR-0025](./ADR-0025-algo-ipcq-direction-addressing.md) — IPCQ Direction Addressing — address-based matching
|
||||
|
||||
### Intercube All-Reduce
|
||||
|
||||
- [ADR-0032](./ADR-0032-algo-intercube-allreduce.md) — 큐브 간 All-Reduce — pe0 큐브-메시 리듀스 + 다중-SIP 교환
|
||||
|
||||
### Evaluation Harnesses
|
||||
|
||||
- [ADR-0043](./ADR-0043-eval-allreduce-harness.md) — Allreduce 평가 하니스 — `tests/sccl/`
|
||||
- [ADR-0044](./ADR-0044-eval-gemm-harness.md) — GEMM 평가 하니스 — `scripts/gemm_sweep.py` + `tests/gemm/`
|
||||
- [ADR-0054](./ADR-0054-eval-milestone-benches.md) — 마일스톤 평가 bench — 자기완결적 sweep + figure bench
|
||||
|
||||
### Bench Module Contract
|
||||
|
||||
- [ADR-0045](./ADR-0045-prog-bench-module-contract.md) — Bench Module Contract — registration, dispatch, and authoring
|
||||
|
||||
### Kernel-side tl.* API (TLContext)
|
||||
|
||||
- [ADR-0046](./ADR-0046-prog-tl-context-contract.md) — TLContext — Kernel-side `tl.*` API Contract
|
||||
|
||||
### Memory Allocator Algorithms
|
||||
|
||||
- [ADR-0048](./ADR-0048-mem-allocator-algorithms.md) — Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
|
||||
|
||||
### Probe Subcommand
|
||||
|
||||
- [ADR-0049](./ADR-0049-ver-probe-subcommand.md) — `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
|
||||
|
||||
### Sim-engine Op Log and Memory Store Schemas
|
||||
|
||||
- [ADR-0052](./ADR-0052-dev-oplog-memory-store-schemas.md) — OpLog + MemoryStore Schemas — sim_engine internals
|
||||
@@ -7,6 +7,11 @@ Accepted
|
||||
Documents the `tests/sccl/` evaluation harness; verified against the
|
||||
implementation (constants, file set, and sweep dimensions cross-checked).
|
||||
|
||||
**Amended by ADR-0054**: the driver core, sweeps, and renderers moved into
|
||||
the `milestone-1h-ccl` bench (single home); `tests/sccl/_allreduce_helpers.py`
|
||||
now re-exports from it (keeping the pytest-only param builders +
|
||||
`_run_distributed` wrapper local). The figure tests are unchanged.
|
||||
|
||||
## Context
|
||||
|
||||
ADR-0032 defines the intercube all-reduce *algorithm*; ADR-0023/0024/0027
|
||||
|
||||
@@ -9,6 +9,12 @@ implementation (constants, tile sizes, figure set, and the script↔test
|
||||
split cross-checked). The D5/D6 caveats are recorded limitations, not
|
||||
inaccuracies.
|
||||
|
||||
**Amended by ADR-0054**: the sweep + renderers moved into the
|
||||
`milestone-1h-gemm` bench (single home); `scripts/gemm_sweep.py` and
|
||||
`tests/gemm/` now re-export from it. D1/D2's "data generation stays a manual
|
||||
script / heavy work is opt-in" is superseded by the eval-bench pattern (one
|
||||
bench regenerates everything; `MILESTONE_FAST=1` reuses the committed JSON).
|
||||
|
||||
## Context
|
||||
|
||||
ADR-0014 (PE pipeline) and ADR-0042 (tile-plan generators) define the GEMM
|
||||
|
||||
@@ -0,0 +1,295 @@
|
||||
# ADR-0045: Bench Module Contract — registration, dispatch, and authoring
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-21).
|
||||
|
||||
Unifies the `src/kernbench/benches/` registration mechanism (@bench), the
|
||||
CLI dispatch path (`kernbench run/list`), and the contract a new bench
|
||||
module must follow. ADR-0010 (CLI surface) specifies the `kernbench
|
||||
list/run` interface, but **how benches are registered and what signature
|
||||
they must follow** had no ADR-level coverage.
|
||||
|
||||
**Extended by ADR-0054**: D5's single-config rule gains a third pattern —
|
||||
the *eval bench* (e.g. `milestone-1h-*`) drives many configs, builds its
|
||||
own per-config engines, and submits a sentinel tensor to satisfy D4.
|
||||
|
||||
## First action
|
||||
|
||||
When `kernbench.benches` is imported, `__init__.py` immediately calls
|
||||
`_eager_import_and_audit(__path__, __name__)`. Its first action is to
|
||||
enumerate every sibling module in the package directory via
|
||||
`pkgutil.iter_modules(__path__)` and **eagerly import** each one via
|
||||
`importlib.import_module(...)` — except modules matching either:
|
||||
|
||||
- name `registry` (the infrastructure module itself), or
|
||||
- name starting with `_` (helper modules).
|
||||
|
||||
At import time, each `@bench(name=..., description=...)` decorator inside
|
||||
the imported module runs, appending `(name, description, fn)` to
|
||||
`_PENDING` and adding `fn.__module__` to `_REGISTERED_MODULES`.
|
||||
|
||||
Once imports finish, `_audit_modules(imported, _REGISTERED_MODULES)`
|
||||
runs; if any imported module did not invoke `@bench` at least once, it
|
||||
raises `RuntimeError("Bench module(s) missing @bench decorator: ...")`
|
||||
immediately. At this point indices are still unassigned — the first call
|
||||
to `list_all()` / `resolve(...)` triggers `_finalize()`, which sorts
|
||||
`_PENDING` alphabetically by name and assigns 1-based indices.
|
||||
|
||||
In short, **the bench infrastructure's first act is "eagerly import
|
||||
every non-helper module in the package and audit that each one
|
||||
registered at least one bench"**.
|
||||
|
||||
## Context
|
||||
|
||||
`src/kernbench/benches/` currently holds 8 bench modules (`ccl_allreduce`,
|
||||
`gemm_single_pe`, `gpt3_qkv`, `ipcq_allreduce`, `matmul_composite`,
|
||||
`qkv_gemm`, `qkv_gemm_multi_pe`, `va_offset_verify`). Every bench follows
|
||||
the same unified flow:
|
||||
|
||||
```
|
||||
kernbench run --topology <T> --bench <N>
|
||||
↓
|
||||
cli/main.py::cmd_run
|
||||
↓ resolve_topology(T) + resolve(N) + resolve_device(device_arg)
|
||||
↓
|
||||
runtime_api/bench_runner.py::run_bench(topology, bench_fn, device, engine_factory)
|
||||
↓ engine_factory(topology, device) → GraphEngine
|
||||
↓ RuntimeContext(engine, target_device, correlation_id, spec)
|
||||
↓
|
||||
bench_fn(ctx) ← invokes the bench's run(torch)
|
||||
↓ ctx.empty/zeros/from_numpy/launch/distributed.* etc. submit work
|
||||
↓
|
||||
ctx.wait_all() ← drains any outstanding handles
|
||||
↓
|
||||
BenchResult(completion, correlation_id, trace, traces, engine)
|
||||
```
|
||||
|
||||
ADR-0010 covers only the CLI surface (`run/list/probe/web`); ADR-0007
|
||||
covers only the runtime API ↔ sim_engine boundary. The question "what
|
||||
shape must a new bench file take?" had to be answered by grepping the
|
||||
codebase. As a result:
|
||||
|
||||
- The @bench decorator contract (kebab-case name, non-empty description)
|
||||
lived only in the source.
|
||||
- The bench function signature (`def run(torch)`) was a de-facto
|
||||
convention enforced by the CLI dispatcher calling `spec.run`.
|
||||
- New bench authors learned the "helpers must use `_` prefix" rule only
|
||||
after seeing the audit's RuntimeError.
|
||||
- The single-device convention (CLAUDE.md Part 2 CLI Semantics) and its
|
||||
interaction with multi-SIP CCL benches was ambiguous for bench
|
||||
authors.
|
||||
|
||||
This ADR consolidates all of it in one place.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. @bench decorator contract
|
||||
|
||||
```python
|
||||
from kernbench.benches.registry import bench
|
||||
|
||||
@bench(name="my-bench", description="Short, complete-sentence description.")
|
||||
def run(torch):
|
||||
...
|
||||
```
|
||||
|
||||
- `name`: kebab-case string matching `^[a-z][a-z0-9]*(-[a-z0-9]+)*$`.
|
||||
Lowercase letters, digits, and dashes only; underscores forbidden;
|
||||
must start with a letter.
|
||||
- `description`: non-empty string (stripped length > 0). Displayed
|
||||
verbatim by `kernbench list`.
|
||||
- The decorator **returns the function unchanged** — direct invocation
|
||||
is fine. Its only side effect is appending to `_PENDING`.
|
||||
|
||||
Violations of the first two rules raise `ValueError` at decoration time.
|
||||
Duplicate names are caught at `_finalize()` with
|
||||
`RuntimeError("duplicate bench name: ...")`.
|
||||
|
||||
### D2. Module-file convention
|
||||
|
||||
Every `src/kernbench/benches/<slug>.py` must be one of:
|
||||
|
||||
- **A bench module**: at top-level import, `@bench(...)` runs at least
|
||||
once to register at least one bench.
|
||||
- **A helper module**: the filename starts with `_` (e.g.,
|
||||
`_shared_helpers.py`). `iter_modules` skips it.
|
||||
|
||||
The audit (`_audit_modules`) rejects any non-helper that fails to call
|
||||
`@bench`. Intended consequence: dropping a new file into `benches/`
|
||||
automatically registers its benches, and helper modules are clearly
|
||||
flagged by their filename prefix alone.
|
||||
|
||||
### D3. The bench function signature is `def run(torch)`
|
||||
|
||||
The decorator does not enforce a function name, but **CLI dispatch calls
|
||||
`spec_entry.run`** (the decorated callable). The convention is therefore:
|
||||
|
||||
- Function name: `run`. Other names work, but always use `run` for
|
||||
readability and grep-ability.
|
||||
- Argument: a single positional `torch`. In practice this is a
|
||||
`RuntimeContext` instance exposing PyTorch-style namespaces
|
||||
(zeros/empty/launch/distributed/...) — see ADR-0024 D3.
|
||||
- Return value: any (`Any`). `run_bench` ignores it and tracks
|
||||
completion via `ctx.handles()` / `engine.get_completion()`.
|
||||
|
||||
The `torch` name imitates a PyTorch-compatible idiom; the actual PyTorch
|
||||
module is not passed in (aligned with ADR-0024's "rank = SIP" launcher
|
||||
convention).
|
||||
|
||||
### D4. A bench must submit at least once
|
||||
|
||||
If `ctx.handles()` is empty after the bench returns, `run_bench` reports
|
||||
`BenchResult.completion = ok=False, error_code="NO_REQUESTS"`. So a
|
||||
meaningful bench must invoke at least one of:
|
||||
|
||||
- Tensor-creation APIs: `torch.zeros(...)`, `torch.empty(...)` — these
|
||||
internally submit `MmuMapMsg` and (for zeros) `MemoryWriteMsg`.
|
||||
- Kernel-launch API: `torch.launch(name, fn, *args)` — submits per-SIP
|
||||
`KernelLaunchMsg`.
|
||||
- (Exception) Empty placeholder benches: e.g.,
|
||||
`ipcq_allreduce.py`'s `print(...)`-only stub will receive a
|
||||
NO_REQUESTS result. CI is expected to recognize and handle placeholder
|
||||
benches specially.
|
||||
|
||||
### D5. Single-device convention + multi-SIP exception (ADR-0024/0027)
|
||||
|
||||
CLAUDE.md Part 2 CLI Semantics' **"benchmarks MUST remain
|
||||
single-device"** rule is interpreted as follows:
|
||||
|
||||
- **Standard bench (single-SIP use)**: define tensor placement with
|
||||
`dp = DPPolicy(...)` and launch with `torch.launch(...)`. The SIP
|
||||
index is chosen by `--device` (CLI's responsibility).
|
||||
- **CCL bench (multi-SIP use)**: as an exception, use
|
||||
`torch.distributed.init_process_group(backend="ahbm")` plus
|
||||
`torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` for the
|
||||
rank = SIP pattern (ADR-0024 D3). `--device` is ignored (or treated
|
||||
as `all`); each spawned worker calls `torch.ahbm.set_device(rank)` to
|
||||
bind to its SIP.
|
||||
|
||||
Multi-device patterns outside these two (e.g., one bench function
|
||||
launching across multiple SIPs in the same process) are forbidden by
|
||||
this ADR. Even with `--device all`, the bench runs once; multi-SIP use
|
||||
inside that single run must follow D5's second pattern.
|
||||
|
||||
### D6. Name/index resolution (`resolve`)
|
||||
|
||||
`resolve(identifier: str)` returns a BenchSpec via:
|
||||
|
||||
1. If `identifier.isdigit()`: convert to int and find the spec where
|
||||
`index ==` that value. If none, `ValueError("No bench with index
|
||||
...")`.
|
||||
2. If `identifier in _REGISTRY`: direct lookup.
|
||||
3. Otherwise: `ValueError("Unknown bench ...")`.
|
||||
|
||||
Empty or whitespace-only identifiers raise `ValueError("bench
|
||||
identifier must be a non-empty string.")`.
|
||||
|
||||
The CLI passes `--bench` directly to `resolve`, so users can use either
|
||||
`kernbench run --bench gemm-single-pe` or `kernbench run --bench 2`.
|
||||
|
||||
### D7. Indices are not a stable API
|
||||
|
||||
`_finalize()` sorts `_PENDING` alphabetically by name and assigns
|
||||
1-based indices. Adding a new bench can shift existing benches'
|
||||
indices. Therefore:
|
||||
|
||||
- Human-interactive use: indices are fine.
|
||||
- Scripts / CI automation: always use the name.
|
||||
|
||||
This caveat is documented in `registry.py`'s module docstring.
|
||||
|
||||
### D8. Surface RuntimeContext exposes to benches
|
||||
|
||||
A bench's `torch` parameter may legitimately use:
|
||||
|
||||
- **Tensor creation**: `torch.empty(shape, dtype=..., dp=DPPolicy(...),
|
||||
name=...)`, `torch.zeros(...)`, `torch.from_numpy(arr)`. All submit
|
||||
host-side metadata plus device deployment (`MmuMapMsg` +
|
||||
`MemoryWriteMsg`).
|
||||
- **Kernel launch**: `torch.launch(kernel_name, kernel_fn, *args)` —
|
||||
converts `(Tensor, int, float)` positional args to `TensorArg` /
|
||||
`ScalarArg`, submits per-SIP `KernelLaunchMsg`, and drains.
|
||||
- **Synchronization**: `torch.wait(handle)`, `torch.wait_all()`
|
||||
(`run_bench` calls the latter automatically).
|
||||
- **Distributed**: `torch.distributed.init_process_group(backend="ahbm")`,
|
||||
`torch.distributed.get_world_size()`,
|
||||
`torch.distributed.all_reduce(t, op=...)` (ADR-0024/0027).
|
||||
- **Multi-process (rank = SIP)**:
|
||||
`torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` (ADR-0024 D3 /
|
||||
ADR-0027).
|
||||
- **Device binding**: `torch.ahbm.set_device(rank)` or
|
||||
`torch.accelerator.set_device_index(rank)` (both point to the same
|
||||
namespace).
|
||||
- **IPCQ install**: `torch.install_ipcq(algorithm=..., ccl_yaml=...)`
|
||||
(ADR-0023 D10).
|
||||
- **Spec lookup**: `torch.spec` — the dict produced by the topology
|
||||
builder (system / cube_mesh / HBM parameters etc.). Use it so the
|
||||
bench does not hardcode topology.yaml values.
|
||||
|
||||
Benches must not access RuntimeContext private members (`_handles`,
|
||||
`_traces`, `_allocators`, etc.) directly. This aligns with ADR-0007's
|
||||
layer-boundary spirit: bench → runtime API → sim_engine flows in one
|
||||
direction.
|
||||
|
||||
### D9. Environment-variable parameterization is allowed
|
||||
|
||||
Benches may parameterize themselves via `os.environ.get(...)`, as
|
||||
`matmul_composite.py` does for `MATMUL_M`, `MATMUL_K`, `MATMUL_N`,
|
||||
`MATMUL_DTYPE`, `MATMUL_VARIANT`. Rationale:
|
||||
|
||||
- The bench function signature is fixed by D3 to `def run(torch)`, so
|
||||
positional/keyword arguments cannot carry parameters.
|
||||
- The env-var pattern is a natural hook for operational sweeps (e.g.,
|
||||
`MATMUL_VARIANT`).
|
||||
- External drivers such as `scripts/gemm_sweep.py` (ADR-0044) consume
|
||||
this hook (it sets `MATMUL_M/K/N/VARIANT` at
|
||||
`scripts/gemm_sweep.py:115-118`).
|
||||
|
||||
When environment variables alter bench behavior, the module docstring
|
||||
must list every variable used (`matmul_composite.py` is the canonical
|
||||
example).
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. An explicit manifest file (YAML) listing benches
|
||||
|
||||
Rejected. The `@bench` + audit pattern guarantees "drop in file → auto-
|
||||
register", concentrating cognitive cost in one place (the file itself).
|
||||
A separate manifest is prone to drift, and helper separation is already
|
||||
clear via the `_` prefix.
|
||||
|
||||
### A2. Allowing the bench's entry-point name in the decorator
|
||||
(`@bench(name=..., entry="run_xxx")`)
|
||||
|
||||
Rejected. Breaks the simplicity of dispatch (`spec.run` is a single
|
||||
callable). The `run` convention is sufficient; variants can register
|
||||
multiple `@bench`-decorated functions in the same module.
|
||||
|
||||
### A3. A separate `@multi_device_bench` decorator for CCL
|
||||
|
||||
Rejected. The two patterns named in D5 (single + ADR-0024 multi-SIP)
|
||||
cover all 8 current benches. A separate decorator would force dispatch
|
||||
to branch and add complexity; the multi-SIP intent is already obvious
|
||||
from the bench's `init_process_group(...)` call.
|
||||
|
||||
### A4. Make indices a stable API (registration order or explicit
|
||||
`index=` argument)
|
||||
|
||||
Rejected. D7's trade-off favors user-friendliness — alphabetically
|
||||
sorted 1-based indices read naturally in the `list` output. Scripts can
|
||||
use names.
|
||||
|
||||
## Consequences
|
||||
|
||||
- "How to add a bench" is consolidated in one ADR — new authors only
|
||||
need to read D1-D3 and D8 without grepping source.
|
||||
- The `_`-prefixed helper-module pattern is legitimized at ADR level,
|
||||
so future `benches/_*.py` shared helpers are free to be added.
|
||||
- The CLI's single-device convention and CCL's multi-SIP exception are
|
||||
shown to be consistent (D5) — they are orthogonal.
|
||||
- The rationale for ADR-0044's GEMM eval harness using env-var hooks
|
||||
(D9) is now ADR-pinned.
|
||||
- Indices are explicitly unstable (D7), so any CI code calling
|
||||
`kernbench run --bench 3` is flagged for review after this ADR is
|
||||
accepted.
|
||||
@@ -0,0 +1,327 @@
|
||||
# ADR-0046: TLContext — Kernel-side `tl.*` API Contract
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
Documents the set of `tl.*` primitives exposed by
|
||||
`src/kernbench/triton_emu/`'s `TLContext`, their semantics, and the two
|
||||
execution-mode contracts (command-list / greenlet runner). ADR-0014/0020
|
||||
defines the PE pipeline and the 2-pass execution model, but **the `tl.*`
|
||||
surface that bench kernel functions call** had no ADR-level coverage.
|
||||
|
||||
## First action
|
||||
|
||||
When `TLContext(pe_id, num_programs, dispatch_cycles, runner, cube_id,
|
||||
num_cubes, scratch_base, scratch_size)` is instantiated, the first action
|
||||
is to initialize six categories of state:
|
||||
|
||||
- `self._pe_id`, `self._num_programs`, `self._cube_id`, `self._num_cubes` —
|
||||
values that `tl.program_id` / `tl.num_programs` will return.
|
||||
- `self._dispatch_cycles` — cycle count emitted as `PeCpuOverheadCmd(cycles)`
|
||||
at the start of every `tl.*` API call.
|
||||
- `self._runner` — `KernelRunner` instance (present → greenlet mode;
|
||||
absent → command-list mode).
|
||||
- `self._commands: list[PeCommand] = []` — command-list accumulator
|
||||
(command-list mode only).
|
||||
- `self._handle_counter = 0`, `self._completion_counter = 0` — counters
|
||||
for generating TensorHandle / CompletionHandle ids.
|
||||
- `self._scratch_base`, `self._scratch_size`, `self._scratch_cursor = 0` —
|
||||
PE-local scratch region (used for math/dot/composite output handle
|
||||
addresses).
|
||||
|
||||
In short, **TLContext's first act is "record where (sip/cube/pe) and at
|
||||
what scale (num_programs/num_cubes) this kernel instance runs, and pick
|
||||
its dispatch mode (runner present or not)"**. No SimPy event is created
|
||||
and no command is emitted at this moment.
|
||||
|
||||
The runtime first action happens when the kernel function first calls a
|
||||
`tl.<api>()`. The standard entry for every `tl.*` API is:
|
||||
|
||||
1. Call `self._emit_dispatch_overhead()` — if `dispatch_cycles > 0`,
|
||||
immediately `_emit` a `PeCpuOverheadCmd(dispatch_cycles)`.
|
||||
2. Per-API processing (TensorHandle creation, command construction).
|
||||
3. `self._emit(cmd)` — in runner mode this `greenlet.switch()`es the cmd
|
||||
to SimPy; in command-list mode it appends to `self._commands`.
|
||||
|
||||
## Context
|
||||
|
||||
The `tl.*` surface consists of `TLContext`'s methods, and the `tl`
|
||||
parameter received by a kernel function is one of these objects. The
|
||||
contract the user (bench author) sees:
|
||||
|
||||
- Which primitives exist.
|
||||
- What data flow each primitive triggers (DMA / compute / IPCQ /
|
||||
metadata-only).
|
||||
- How a TensorHandle's `space` and `addr` are decided.
|
||||
- The difference between command-list and greenlet modes.
|
||||
|
||||
ADR-0014 (PE pipeline) defines the PeCommands consumed by PE_SCHEDULER,
|
||||
but how `tl.*` emits them is a code-only convention. ADR-0020 (2-pass
|
||||
data execution) mentions greenlet mode in D3 but does not pin down the
|
||||
signature difference (return-value handling) between the runner /
|
||||
non-runner paths. This ADR fills the gap.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. The `tl` parameter is a `TLContext` instance
|
||||
|
||||
A bench kernel function has the signature:
|
||||
|
||||
```python
|
||||
def _kernel(arg1, arg2, ..., tl, **kwargs):
|
||||
...
|
||||
```
|
||||
|
||||
`tl` is a `kernbench.triton_emu.tl_context.TLContext` instance. The name
|
||||
imitates real Triton's `triton.language` module; the actual Triton
|
||||
module is **not** passed in.
|
||||
|
||||
The kernel is plain Python — no `yield` or `async`. `tl.*` calls produce
|
||||
SimPy events, but to the caller they appear synchronous because in
|
||||
greenlet mode the KernelRunner relays between SimPy and the kernel
|
||||
(ADR-0020 D3).
|
||||
|
||||
### D2. Two execution modes — command-list / greenlet runner
|
||||
|
||||
- **Command-list mode (`runner is None`)**: `tl.*` calls append PeCommand
|
||||
to `self._commands`. DMA / GEMM / Math consume no SimPy time and return
|
||||
metadata-only TensorHandles (`data=None`). PE_SCHEDULER / sim_engine
|
||||
later replays the command sequence in time.
|
||||
|
||||
- **Greenlet runner mode (`runner is not None`)**: `tl.*` calls
|
||||
`self._emit(cmd)` → `runner.switch_to_simpy(cmd)`, handing control to
|
||||
the parent greenlet (SimPy). The parent distributes the cmd to
|
||||
components, consumes SimPy time, and (for DMA reads) returns real numpy
|
||||
data. The kernel receives the result and continues to the next line
|
||||
(the data-aware execution model from ADR-0020 D3).
|
||||
|
||||
The choice of mode is decided by whether a KernelRunner is injected into
|
||||
the TLContext. The `tl.*` methods themselves are mode-blind — they go
|
||||
through `_emit()` uniformly.
|
||||
|
||||
### D3. Primitive categories
|
||||
|
||||
#### D3.1. Reference (no DMA, metadata only)
|
||||
|
||||
- `tl.ref(ptr, shape, dtype="f16") -> TensorHandle`: create a handle
|
||||
referencing HBM data without issuing DMA. Used when the scheduler
|
||||
streams the data per-tile (e.g., the b operand of a composite GEMM).
|
||||
|
||||
#### D3.2. Data movement (blocking, DMA engine)
|
||||
|
||||
- `tl.load(ptr, shape, dtype="f16") -> TensorHandle`: HBM → handle.
|
||||
Emits `DmaReadCmd`. In greenlet mode the returned handle's `.data`
|
||||
carries real numpy data; in command-list mode it is a placeholder.
|
||||
The handle has `space="hbm"`, `pinned=True`.
|
||||
- `tl.store(ptr, handle) -> None`: TCM → HBM. Emits `DmaWriteCmd`. In
|
||||
greenlet mode, when `handle.data` is present, `_store.write("hbm",
|
||||
ptr, data)` runs first (visibility = issue time, ADR-0020 D3).
|
||||
|
||||
#### D3.3. GEMM / compute (blocking)
|
||||
|
||||
- `tl.dot(a, b) -> TensorHandle`: `a @ b`. Both operands must live in
|
||||
TCM; shapes `(M,K) × (K,N) → (M,N)`. Emits `GemmCmd`; the output
|
||||
handle is allocated from PE-local scratch via
|
||||
`_make_compute_out(shape, dtype)`.
|
||||
- `tl.composite(op, a, b=None, out_ptr=0, math_op=None, epilogue=None,
|
||||
acc_dtype=None, tile_shape=None) -> CompletionHandle`: non-blocking
|
||||
tiled pipeline. Emits `CompositeCmd`. `epilogue` is a list of dicts,
|
||||
each with `"op"` plus op-specific fields and an optional `"scope"`
|
||||
(k_tile / output_tile). Unknown ops or missing fields raise
|
||||
ValueError immediately. The returned CompletionHandle synchronizes
|
||||
via `tl.wait(h)`.
|
||||
|
||||
#### D3.4. Math: unary (blocking)
|
||||
|
||||
- `tl.exp(x)`, `tl.log(x)`, `tl.sqrt(x)`, `tl.abs(x)`, `tl.sigmoid(x)`,
|
||||
`tl.cos(x)`, `tl.sin(x)` — each emits `MathCmd(op=<name>,
|
||||
inputs=(x,), out=)`. `out` is scratch-allocated with the same
|
||||
shape/dtype as `x`.
|
||||
|
||||
#### D3.5. Math: binary (blocking)
|
||||
|
||||
- `tl.maximum(a, b)`, `tl.minimum(a, b)` — `_binary_math`.
|
||||
- `tl.fma(a, b, c)` — `a*b + c`. Three inputs.
|
||||
- `tl.clamp(x, min, max)` — `MathCmd(op="clamp", inputs=(x, min, max))`.
|
||||
- `tl.where(cond, a, b)` — `MathCmd(op="where", inputs=(cond, a, b))`.
|
||||
- `tl.softmax(x, axis=-1)` — a single `MathCmd(op="softmax")` so timing
|
||||
accounts at one dispatch. Phase 2 DataExecutor expands it to the
|
||||
canonical (x-max → exp → sum → div) sequence.
|
||||
|
||||
#### D3.6. Reduction (blocking)
|
||||
|
||||
- `tl.sum(x, axis)`, `tl.max(x, axis)`, `tl.min(x, axis)` — return an
|
||||
output handle with the axis size collapsed to 1. Emit
|
||||
`MathCmd(op=<name>, inputs=(x,), out=, axis=axis)`.
|
||||
|
||||
#### D3.7. Index / scalar (PE_CPU, no engine)
|
||||
|
||||
- `tl.program_id(axis=0) -> int`: `axis==0` → pe_id (cube-local PE
|
||||
index), `axis==1` → cube_id (ADR-0022).
|
||||
- `tl.num_programs(axis=0) -> int`: `axis==0` → num_programs (PEs per
|
||||
cube), `axis==1` → num_cubes.
|
||||
- `tl.arange(start, end, dtype="i32") -> TensorHandle`: an index range
|
||||
in TCM. No command emitted.
|
||||
- `tl.zeros(shape, dtype="f16") -> TensorHandle`, `tl.full(shape,
|
||||
value, dtype="f16") -> TensorHandle`: TCM placeholder. No command
|
||||
emitted.
|
||||
|
||||
#### D3.8. Scalar helpers (no command, no engine)
|
||||
|
||||
- `TLContext.cdiv(a, b) -> int` (static): ceiling division
|
||||
`-(-a // b)`. Mirrors real Triton's `tl.cdiv`.
|
||||
|
||||
#### D3.9. Metadata-only (no compute, no DMA)
|
||||
|
||||
- `tl.trans(x) -> TensorHandle`: a new handle with the last two dims
|
||||
swapped. Shares `addr` and `data`; no command emitted.
|
||||
|
||||
#### D3.10. IPCQ (CCL) primitives (ADR-0023 D4)
|
||||
|
||||
- `tl.send(dir, src=None, *, src_addr=None, nbytes=None, shape=None,
|
||||
dtype="f16", space="tcm") -> None`: blocking send. Accepts either
|
||||
handle form or raw-address form. Emits `IpcqSendCmd`. The handle's
|
||||
`.data` snapshot rides along on the command — avoiding the race
|
||||
where a later inbound IPCQ overwrites the slot before the outbound
|
||||
PE_DMA reads it.
|
||||
- `tl.recv(dir=None, shape=(), dtype="f16", space="tcm", dst_addr=None,
|
||||
dst_space=None) -> TensorHandle`: blocking recv. Providing both
|
||||
`dst_addr` and `dst_space` enters "copy_to_dst" mode; otherwise
|
||||
"return_slot" mode. In greenlet mode the handle's `.data` carries
|
||||
the real data.
|
||||
- `tl.recv_no_consume(dir=None, shape=(), dtype="f16") -> TensorHandle`:
|
||||
**DIAGNOSTIC ONLY**. Has the same blocking-arrival semantics as
|
||||
`tl.recv` but skips the slot-read latency charge (slot-IO + PE↔bank
|
||||
fabric drain). Used in the pe2pe overview plot for an apples-to-apples
|
||||
comparison against `tl.store`. Production kernels MUST NOT use it —
|
||||
the diagnostic flag is isolated in its own command branch
|
||||
(`consume=False`) so it cannot be accidentally enabled.
|
||||
- `tl.recv_async(dir, shape=(), dtype="f16") -> RecvFuture`: non-blocking
|
||||
recv. Returns a `RecvFuture`; resolved later by `tl.wait(future)`.
|
||||
|
||||
#### D3.11. Composite + control
|
||||
|
||||
- `tl.composite(...)`: see D3.3.
|
||||
- `tl.wait(handle=None)`: wait on a `CompletionHandle` (composite), a
|
||||
`RecvFuture` (async recv), or `None` (all pending composites).
|
||||
- `tl.cycles(n)`: declare a scalar PE_CPU overhead. Emits
|
||||
`PeCpuOverheadCmd(cycles=n)`.
|
||||
|
||||
### D4. TensorHandle arithmetic operators — thread-local TLContext
|
||||
|
||||
At module load, `tl_context.py::_enable_tensor_ops()` runs and patches
|
||||
`TensorHandle.__add__`, `__sub__`, `__mul__`, `__truediv__`. Each
|
||||
operator calls `_binary_math` on the active TLContext stored in a
|
||||
module-level thread-local `_ctx`.
|
||||
|
||||
So inside a kernel, `c = a + b` is equivalent to emitting
|
||||
`MathCmd(op="add", inputs=(a, b), out=)` and returning a new
|
||||
TensorHandle.
|
||||
|
||||
Active-TLContext management:
|
||||
|
||||
- `TLContext._set_active(ctx)`: set the active ctx for the current
|
||||
thread/greenlet.
|
||||
- `TLContext._get_active()`: read it (RuntimeError if unset).
|
||||
- `run_kernel(kernel_fn, tl_ctx, *args, **kwargs)`: helper. Sets active
|
||||
on entry, runs the kernel, restores `None` on exit.
|
||||
|
||||
`KernelRunner` re-asserts `_set_active(tl)` inside its `_switch_kernel`
|
||||
just before resuming the kernel, so a sibling PE runner that overwrote
|
||||
the thread-local context is correctly recovered.
|
||||
|
||||
### D5. Scratch allocator — compute output handles
|
||||
|
||||
Ops that produce a result — `tl.dot`, `tl.exp`, `tl.add` (via
|
||||
TensorHandle `__add__`), etc. — call `_make_compute_out(shape, dtype)`
|
||||
to obtain a 16-byte-aligned scratch address. The address is published
|
||||
with `space="tcm"`, so the handle can later be the source of a
|
||||
`tl.send` / `tl.store`.
|
||||
|
||||
When `_scratch_base == 0` (e.g., command-list mode), the address is 0
|
||||
and the handle cannot be a send/store source (in that case, only
|
||||
`tl.load`-returned handles are valid sources).
|
||||
|
||||
When the cursor exceeds `_scratch_size` (default 1 MiB), a
|
||||
RuntimeError is raised. The cursor must reset between kernel
|
||||
invocations (current code naturally satisfies this: KernelRunner
|
||||
creates a fresh TLContext each time).
|
||||
|
||||
### D6. Dispatch overhead — `PeCpuOverheadCmd(dispatch_cycles)`
|
||||
|
||||
Every non-metadata `tl.*` call starts with `_emit_dispatch_overhead()`,
|
||||
which — when `dispatch_cycles > 0` — emits
|
||||
`PeCpuOverheadCmd(dispatch_cycles)`. This models the cycles PE_CPU
|
||||
spends dispatching the command.
|
||||
|
||||
Defaults:
|
||||
|
||||
- `TLContext.__init__`'s `dispatch_cycles` parameter default: `1` cycle.
|
||||
- TLContext built by `KernelRunner`: `0` cycles (greenlet mode handles
|
||||
cycle accounting differently — aligned with ADR-0020 D3 intent).
|
||||
|
||||
### D7. Kernel registry (`triton_emu/registry.py`)
|
||||
|
||||
A separate `_kernels: dict[str, Callable]` holds the name → function
|
||||
mapping:
|
||||
|
||||
- `register_kernel(name, fn)`: ValueError on duplicate.
|
||||
- `get_kernel(name)`: KeyError if missing.
|
||||
- `clear_registry()`: test-only.
|
||||
|
||||
`RuntimeContext.launch(kernel_name, kernel_fn, *args)` overwrites
|
||||
`_kernels[kernel_name] = kernel_fn` on every call (last-call-wins,
|
||||
idempotent) — consistent with ADR-0045 D8's `launch` behavior.
|
||||
|
||||
PE_CPU looks up `KernelRef.name` in the registry and runs the function
|
||||
through KernelRunner.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. Fold `tl.*` into ADR-0014 / ADR-0020
|
||||
|
||||
Rejected. ADR-0014 covers the PE pipeline (sim_engine-side consumption
|
||||
of PeCommands); ADR-0020 covers 2-pass execution (Phase 1 timing /
|
||||
Phase 2 data). The `tl.*` surface is what the kernel author touches; a
|
||||
dedicated ADR improves findability and onboarding.
|
||||
|
||||
### A2. Deprecate command-list mode
|
||||
|
||||
Rejected (currently). Simple unit tests and kernel verification benefit
|
||||
from the lighter command-list path — it exposes a PeCommand sequence
|
||||
inspector without requiring greenlet machinery. When greenlet-mode
|
||||
semantics (real data, Phase 2) are needed, D2 explicitly selects them.
|
||||
|
||||
### A3. Remove TensorHandle arithmetic operators
|
||||
|
||||
Rejected. They mimic real Triton kernel ergonomics (e.g., `c = a + b`),
|
||||
and the thread-local active-ctx pattern works cleanly. The explicit
|
||||
function-form (`tl.add(a, b)`) is also exposed in D3.5, so the
|
||||
operators are syntactic sugar.
|
||||
|
||||
### A4. Expand softmax into the explicit sequence (max → exp → sum → div)
|
||||
|
||||
Partially adopted. `tl.softmax` is a single `MathCmd(op="softmax")` for
|
||||
timing accounting (D3.5), but Phase 2 DataExecutor expands it to the
|
||||
canonical sequence for real-data computation. Timing model atomic,
|
||||
data model expanded — the two split intentionally.
|
||||
|
||||
## Consequences
|
||||
|
||||
- Every `tl.*` primitive a bench author meets is classified and defined
|
||||
in a single ADR. Paired with ADR-0045 D8's host-side surface
|
||||
(`torch.empty` etc.), the inside-kernel and outside-kernel authoring
|
||||
guides are now complete.
|
||||
- The command-list / greenlet difference is pinned in D2, so any new
|
||||
`tl.*` primitive that follows the `_emit()` pattern auto-supports
|
||||
both modes.
|
||||
- The thread-local active-ctx pattern (D4) is justified at ADR level,
|
||||
clarifying who owns the reset responsibility when multiple PE
|
||||
runners share a thread (KernelRunner.run's contract restores active
|
||||
inside `_switch_kernel`).
|
||||
- `tl.recv_no_consume`'s diagnostic isolation (D3.10) is hardened in
|
||||
ADR form — accidental production use is blocked by a separate
|
||||
command branch.
|
||||
- The registry (D7) gets its own D-section, formalizing the
|
||||
name-collision and dynamic-re-registration semantics.
|
||||
@@ -0,0 +1,259 @@
|
||||
# ADR-0047: AHBM CCL Backend — `torch.distributed`-compat shim
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
Pins down what `runtime_api/distributed.py`'s `AhbmCCLBackend` +
|
||||
`DistributedContext` actually install — i.e., the entry point
|
||||
`torch.distributed.init_process_group(backend="ahbm")` — and how
|
||||
`all_reduce`/`barrier`/`get_rank` etc. are implemented. ADR-0023 D11
|
||||
mentions the "torch.distributed compatibility" intent, but **the backend
|
||||
itself** had no ADR-level coverage.
|
||||
|
||||
## First action
|
||||
|
||||
`RuntimeContext.__post_init__` automatically constructs a
|
||||
`DistributedContext()` and attaches it to `self.distributed`. The first
|
||||
action at that moment:
|
||||
|
||||
1. `self._backend: AhbmCCLBackend | None = None` — uninitialized.
|
||||
2. `self._rank_by_greenlet: dict = {}` — greenlet-local rank registry
|
||||
(ADR-0024 D2).
|
||||
3. The caller (RuntimeContext) sets `dc._ctx_ref = self` so subsequent
|
||||
`init_process_group` can reach `ctx.engine` / `ctx.spec` / `ctx.launch`.
|
||||
|
||||
In short, **DistributedContext's first act is "attach to RuntimeContext
|
||||
with a back-reference and leave the backend slot empty"**. Actual
|
||||
backend installation (IPCQ install, world_size derivation, algorithm
|
||||
module import) happens only when user code calls
|
||||
`torch.distributed.init_process_group(backend="ahbm")`.
|
||||
|
||||
At that moment, `init_process_group`'s first action is:
|
||||
|
||||
1. If `backend != "ahbm"`, raise `ValueError("Unsupported backend ...")`
|
||||
immediately.
|
||||
2. If `getattr(self, "_ctx_ref", None)` is None,
|
||||
`RuntimeError("DistributedContext not bound to a RuntimeContext")`.
|
||||
3. `self._backend = AhbmCCLBackend(torch_ctx=ctx)` — inside this
|
||||
constructor, ccl.yaml is loaded, the algorithm module is imported,
|
||||
world_size is derived, SFR is configured, and IPCQ is installed.
|
||||
4. `self._backend._dist_ctx = self` — the backend gets a back-reference
|
||||
so it can read `_rank_by_greenlet`.
|
||||
|
||||
## Context
|
||||
|
||||
The `AhbmCCLBackend` exists so that PyTorch DDP collective calls
|
||||
(`init_process_group`, `all_reduce`, etc.) work unchanged and bench code
|
||||
reads identically to a real DDP training script (in line with
|
||||
ADR-0024 + ADR-0027's launcher model).
|
||||
|
||||
The backend's responsibilities:
|
||||
|
||||
- At `init_process_group` time, install the **IPCQ neighbor table once**
|
||||
(analogous to NCCL communicator creation).
|
||||
- For each `all_reduce(tensor, op="sum")`, dispatch the configured
|
||||
algorithm's kernel function via `ctx.launch(...)`.
|
||||
- Answer `get_world_size` / `get_rank` consistently from the
|
||||
greenlet-local rank registry plus ccl.yaml/topology.
|
||||
|
||||
ADR-0023 D10 (IPCQ install plan) and ADR-0024 (SIP launcher) touch
|
||||
parts of this, but **the backend's own responsibility scope and decision
|
||||
order** are not pinned anywhere. This ADR fills that gap.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. The backend is created only at `init_process_group(backend="ahbm")` time
|
||||
|
||||
`DistributedContext` starts with `_backend = None`. The backend object
|
||||
does not exist until the user calls
|
||||
`dist.init_process_group(backend="ahbm")`. Any other API
|
||||
(`is_initialized`, `get_world_size`, `all_reduce`, `barrier`) called
|
||||
while `_backend` is None raises
|
||||
`RuntimeError("Default process group has not been initialized...")` via
|
||||
the `_ensure_initialized` helper.
|
||||
|
||||
`backend != "ahbm"` raises `ValueError` immediately. Other backend names
|
||||
(`nccl`, `gloo`, etc.) are not recognized.
|
||||
|
||||
### D2. world_size resolution priority — algorithm > defaults > topology
|
||||
|
||||
`AhbmCCLBackend._resolve_world_size` (ADR-0024 D1):
|
||||
|
||||
1. If `ccl.yaml`'s algorithm entry has `world_size`, use it.
|
||||
2. Else if `defaults.world_size` is set, use it.
|
||||
3. Else fall back to `spec.system.sips.count` (the topology's SIP count).
|
||||
|
||||
The default interpretation is **rank = SIP** (ADR-0024). Cube/PE-level
|
||||
parallelism is expressed inside each rank via DPPolicy and does not
|
||||
affect world_size. An explicit `ccl.yaml` override is preserved for the
|
||||
legacy "rank = flat PE index" test path.
|
||||
|
||||
User arguments to `init_process_group(world_size=..., rank=...)` are
|
||||
**accepted but ignored** (same as real PyTorch's `RANK` / `WORLD_SIZE`
|
||||
env vars).
|
||||
|
||||
### D3. `init_process_group` performs four installation steps
|
||||
|
||||
Inside `AhbmCCLBackend.__init__`, in order:
|
||||
|
||||
1. **Load ccl.yaml**: `kernbench.ccl.install.load_ccl_config()` →
|
||||
`resolve_algorithm_config(_cfg_all)` produces the merged config for
|
||||
`defaults.algorithm` (or the user-specified algorithm).
|
||||
2. **Import algorithm module**:
|
||||
`importlib.import_module(self._merged["module"])`. The module must
|
||||
expose a `kernel` function, a `kernel_args(world_size, n_elem,
|
||||
cube_w, cube_h)` helper, and optionally a `TOPO_NAME_TO_KIND` map.
|
||||
3. **Resolve world_size** (D2).
|
||||
4. **Collect topology metadata** from `spec`: `n_sips`, `sip_topo`
|
||||
(`ring_1d` default), `cube_w`/`cube_h`, `sips.w`/`sips.h`. When the
|
||||
SIP topology is not `ring_1d`, derive `_sip_topo_w/h` from explicit
|
||||
`w`/`h` or via square-root (require `w*h == n_sips`). Mismatch raises
|
||||
`ValueError`.
|
||||
5. **Install SFR + IPCQ**:
|
||||
`kernbench.ccl.sfr_config.configure_sfr_intercube_multisip(engine,
|
||||
spec, self._merged)`. This pushes IPCQ neighbor tables to every
|
||||
SIP/cube's pe0 (one-time setup analogous to NCCL communicator
|
||||
creation).
|
||||
|
||||
If the order changes (e.g., SFR runs before the algorithm module
|
||||
loads), partial initialization can result. So D3 is treated as an
|
||||
atomic 4-step block — on failure the backend remains uninstalled.
|
||||
|
||||
### D4. Greenlet-local rank binding (ADR-0024 D2)
|
||||
|
||||
`DistributedContext._rank_by_greenlet: dict[greenlet, int]` maps spawned
|
||||
worker greenlets to their ranks. When the bench launcher (e.g.,
|
||||
`torch.multiprocessing.spawn`) spawns a worker, it registers via
|
||||
`dc._bind_rank(g, rank)`.
|
||||
|
||||
`get_rank()` looks up `getcurrent()`'s greenlet. Unregistered greenlets
|
||||
fall back to 0 — preserves single-driver / test compatibility.
|
||||
|
||||
The backend reads the current greenlet's rank from
|
||||
`_dist_ctx._rank_by_greenlet` during `all_reduce` (D5).
|
||||
|
||||
### D5. `all_reduce(tensor, op="sum")` behavior
|
||||
|
||||
Validation:
|
||||
|
||||
- `op != "sum"` → `NotImplementedError`. Current kernels only
|
||||
implement add reduction.
|
||||
- `tensor._handle is None` → `RuntimeError("not deployed")`.
|
||||
- `tensor._handle.shards` empty → `RuntimeError("no shards")`.
|
||||
|
||||
Preparation:
|
||||
|
||||
- `n_elem = shards[0].nbytes // tensor.itemsize` — element count of a
|
||||
single shard.
|
||||
- `kernel_fn = self._algo_module.kernel` — the algorithm module's entry
|
||||
function (imported in D3).
|
||||
- Decide effective cube dims: if the first SIP has just 1 cube, use
|
||||
`(1, 1)`; otherwise use the topology's `cube_w`/`cube_h`. This
|
||||
naturally absorbs TP runs that use only a subset of cubes.
|
||||
- `kernel_args = self._algo_module.kernel_args(world_size, n_elem,
|
||||
cube_w, cube_h)` — the algorithm decides which arguments to pass to
|
||||
its kernel.
|
||||
|
||||
Dispatch:
|
||||
|
||||
- Resolve the current greenlet's rank via
|
||||
`_rank_by_greenlet.get(g, 0)`.
|
||||
- Append `extra_args = (sip_rank, sip_topo_kind, sip_topo_w,
|
||||
sip_topo_h)`.
|
||||
- `pending = self.ctx.launch(algorithm_name, kernel_fn, tensor,
|
||||
*kernel_args, *extra_args, _defer_wait=True)` — `_defer_wait=True`
|
||||
delegates collective drain to the main scheduler (ADR-0027 D0.4).
|
||||
|
||||
Drain:
|
||||
|
||||
- If the parent greenlet is alive (multi-greenlet mode), enqueue
|
||||
`_pending_collective_handles` and switch to parent. The main
|
||||
scheduler drains after all ranks have launched.
|
||||
- If single-driver mode, drain inline:
|
||||
`for h, _sip_id, meta in pending: self.ctx.wait(h, _meta=meta)`.
|
||||
|
||||
### D6. `barrier()` is a no-op (single-driver model)
|
||||
|
||||
kernbench runs all ranks as greenlets inside a single Python process,
|
||||
so no cross-process synchronization is needed. `barrier()` is callable
|
||||
but does no synchronization. Kept for real-PyTorch API compatibility so
|
||||
callers don't get `NotImplementedError`.
|
||||
|
||||
If multi-process kernbench (SimPy event loop per process) is introduced
|
||||
in the future, D6 needs a superseding ADR.
|
||||
|
||||
### D7. Semantics of `get_rank` / `get_world_size` / `get_backend`
|
||||
|
||||
- `get_rank()` (D4): the current greenlet's bound rank; unregistered → 0.
|
||||
- `get_world_size()` (D2): the world_size resolved by the backend in D3.
|
||||
- `get_backend()`: always the literal string `"ahbm"`. Calling before
|
||||
backend exists triggers `_ensure_initialized`'s RuntimeError.
|
||||
|
||||
Differences vs. real PyTorch:
|
||||
|
||||
- Real PyTorch `get_rank()` is a process-global value; here it is
|
||||
greenlet-local. Inside a spawned worker → the worker's rank; in the
|
||||
main thread → 0. Bench authors should expect meaningful ranks only
|
||||
inside worker functions.
|
||||
|
||||
### D8. Supported API surface (final)
|
||||
|
||||
`DistributedContext` exposes:
|
||||
|
||||
- `init_process_group(backend="ahbm", world_size=None, rank=None,
|
||||
**kwargs)`
|
||||
- `is_initialized() -> bool`
|
||||
- `get_world_size() -> int`
|
||||
- `get_rank() -> int`
|
||||
- `get_backend() -> str`
|
||||
- `all_reduce(tensor, op="sum") -> None`
|
||||
- `barrier() -> None`
|
||||
- (internal) `_bind_rank(g, rank)`
|
||||
|
||||
Other PyTorch distributed APIs (`broadcast`, `reduce`, `all_gather`,
|
||||
`gather`, `scatter`, point-to-point `send/recv`, etc.) are **not
|
||||
implemented**. Kernel-level expression is available via
|
||||
`tl.send`/`tl.recv` (ADR-0046 D3.10), but the `dist.*` surface does not
|
||||
expose them. If additional collectives are needed, add a paired
|
||||
(algorithm module, `DistributedContext` method) and extend D8.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. Create the backend in `RuntimeContext.__init__`
|
||||
|
||||
Rejected. If `ccl.yaml` is missing or the algorithm module can't be
|
||||
imported, RuntimeContext construction would fail even when the bench
|
||||
does not use distributed features. Lazy creation at call time (D1) is
|
||||
the right semantics.
|
||||
|
||||
### A2. Always derive world_size from topology (no override)
|
||||
|
||||
Rejected. ADR-0024 D1's "explicit override" path is used by legacy
|
||||
tests. Diagnostic scenarios that define PE-level ranks within a single
|
||||
SIP also need this escape hatch.
|
||||
|
||||
### A3. Silent fallback for unsupported `op`
|
||||
|
||||
Rejected. If the user intends `op="prod"` / `"max"` / `"avg"` and silent
|
||||
`sum` runs instead, result validation gets very hard. Explicit
|
||||
`NotImplementedError` is safer.
|
||||
|
||||
### A4. Implement `barrier` as a SimPy event
|
||||
|
||||
Rejected (currently). With single-driver semantics there is no
|
||||
cross-process synchronization to express, so a no-op is meaningfully
|
||||
correct. A fake-barrier SimPy event would add code complexity for no
|
||||
semantic gain. Revisit when multi-process kernbench arrives.
|
||||
|
||||
## Consequences
|
||||
|
||||
- The 4-step installation (D3) for
|
||||
`torch.distributed.init_process_group(backend="ahbm")` is locked in,
|
||||
making clear where future collective algorithms must hook.
|
||||
- The priority order in D2 (algorithm > defaults > topology) makes the
|
||||
blast radius of ccl.yaml changes quickly knowable.
|
||||
- The no-op `barrier` (D6) is recorded so multi-process kernbench, if
|
||||
introduced, must explicitly supersede this ADR.
|
||||
- D8's list of unsupported APIs explicitly grounds the rejection
|
||||
message when users call, e.g., `dist.broadcast(...)`.
|
||||
@@ -0,0 +1,278 @@
|
||||
# ADR-0048: Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
Pins down the free-list algorithm, page alignment, and coalescing rules
|
||||
used by `policy/address/allocator.py`'s `_FreeList` / `PEMemAllocator`
|
||||
and `va_allocator.py`'s `VirtualAllocator`. ADR-0001 (PhysAddr layout)
|
||||
and ADR-0011 (PA/VA/LA models) define the address schemes; the
|
||||
**allocation algorithms** had no ADR-level coverage.
|
||||
|
||||
## First action
|
||||
|
||||
### `_FreeList(capacity)`
|
||||
|
||||
On construction: `self._capacity = capacity`, `self._used = 0`,
|
||||
`self._free = [(0, capacity)]`. The first act is **establishing the
|
||||
entire region as one free block** — the tuple `(offset=0,
|
||||
size=capacity)` is the sole entry in the free list.
|
||||
|
||||
### `PEMemAllocator(sip_id, die_id, pe_id, cfg)`
|
||||
|
||||
On construction, builds two `_FreeList`s:
|
||||
|
||||
- `self._hbm = _FreeList(cfg.hbm_slice_bytes)` — the size of this PE's
|
||||
HBM slice (`hbm_bytes_per_cube // hbm_slices_per_cube`).
|
||||
- `self._tcm = _FreeList(cfg.tcm_allocatable_bytes)` — equals
|
||||
`tcm_bytes_per_pe - tcm_scheduler_reserved_bytes` (the scheduler
|
||||
reservation is pre-deducted).
|
||||
|
||||
So PEMemAllocator's first act is **constructing single-free-block
|
||||
HBM-slice and TCM regions for this PE**.
|
||||
|
||||
### `VirtualAllocator(va_base, va_size, page_size=2*1024*1024)`
|
||||
|
||||
On construction: `self._va_base = va_base`, `self._va_size = va_size`,
|
||||
`self._page_size = page_size`, `self._used = 0`, `self._free =
|
||||
[(va_base, va_size)]`. The first act is **establishing one block from
|
||||
va_base to va_size and stashing page_size**.
|
||||
|
||||
## Context
|
||||
|
||||
`runtime_api/context.py::_ensure_allocators` builds the allocator set
|
||||
in these stages:
|
||||
|
||||
1. Read `hbm_total_gb_per_cube`, `hbm_slices_per_cube`, `tcm_size_mb`,
|
||||
per-target_device SIP range, etc. from `spec`.
|
||||
2. Pack everything into a frozen `AddressConfig`.
|
||||
3. For every combination in the target SIP range × cubes × PEs,
|
||||
construct one `PEMemAllocator(sip, cube, pe, cfg)` instance.
|
||||
4. Construct one `VirtualAllocator(va_base=0x1_0000_0000, va_size=64
|
||||
GiB, page_size=pe_mmu.page_size)`.
|
||||
|
||||
Allocator responsibilities:
|
||||
|
||||
- **PEMemAllocator**: PA-space allocation in the PE-local HBM slice /
|
||||
TCM (including PhysAddr encoding).
|
||||
- **VirtualAllocator**: device-wide VA allocation, page-aligned.
|
||||
`RuntimeContext._create_tensor` then pushes VA → PA mappings to
|
||||
components via `MmuMapMsg`.
|
||||
|
||||
These algorithms are:
|
||||
|
||||
- **First-fit**, kept simple.
|
||||
- The free-block list is **sorted by start offset**.
|
||||
- On `free()`, **adjacent blocks coalesce**.
|
||||
|
||||
The rationale was not documented anywhere, so when someone asks "why
|
||||
not best-fit?", "why not a buddy allocator?", "why does partial-overlap
|
||||
free pass silently?", there was no anchor to answer from. This ADR
|
||||
provides it.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. `_FreeList` — offset-keyed first-fit + coalescing
|
||||
|
||||
`policy/address/allocator.py::_FreeList`:
|
||||
|
||||
- Internal representation: `list[tuple[int, int]] = [(start_offset,
|
||||
size), ...]` — sorted by start offset.
|
||||
- `alloc(nbytes)`:
|
||||
1. Iterate the free list from the front (first-fit).
|
||||
2. Take from the first block with `size >= nbytes`.
|
||||
3. Exact match → drop the block; otherwise shrink it to `(start +
|
||||
nbytes, size - nbytes)`.
|
||||
4. `_used += nbytes`; return the taken `start`.
|
||||
5. If no block fits, `AllocationError("overflow ... largest free
|
||||
block ...")`.
|
||||
- `free(offset, nbytes)`:
|
||||
1. `_used -= nbytes`.
|
||||
2. `bisect_left(self._free, (offset,))` finds the insertion index.
|
||||
3. If adjacent to the previous block (`prev_start + prev_size ==
|
||||
offset`), merge.
|
||||
4. If adjacent to the next block (`offset + nbytes == next_start`),
|
||||
merge.
|
||||
5. Insert the coalesced range at the right sorted position.
|
||||
|
||||
This algorithm is weaker than best-fit / buddy on fragmentation, but
|
||||
the simulator's workload (mostly stack-like deploy/free) tolerates it.
|
||||
If the workload shape changes, D1 is a supersession candidate.
|
||||
|
||||
### D2. Partial-overlap free is **not** validated
|
||||
|
||||
`_FreeList.free(offset, nbytes)` trusts the caller to pass the exact
|
||||
`(offset, nbytes)`. It does **not** verify:
|
||||
|
||||
- That the range was actually allocated.
|
||||
- That the range does not overlap another allocated region.
|
||||
|
||||
Reason: in a simulator context, callers always store the return value
|
||||
of `alloc()` and pass it back to `free()` — there is no external user
|
||||
input. Adding a safety check would cost O(N) per free and impact
|
||||
simulation wall-clock.
|
||||
|
||||
If this trust model breaks (e.g., a code path lets two tensors point
|
||||
at the same PA), this ADR must be revisited.
|
||||
|
||||
### D3. `PEMemAllocator` — two channels for HBM/TCM
|
||||
|
||||
`PEMemAllocator(sip_id, die_id, pe_id, cfg)` holds two `_FreeList`s:
|
||||
|
||||
- `_hbm`: size `cfg.hbm_slice_bytes`.
|
||||
- `_tcm`: size `cfg.tcm_allocatable_bytes` (= `tcm_bytes_per_pe -
|
||||
tcm_scheduler_reserved_bytes`).
|
||||
|
||||
`alloc_hbm(nbytes) -> PhysAddr`:
|
||||
|
||||
- `_hbm.alloc(nbytes)` → offset.
|
||||
- `PhysAddr.pe_hbm_addr(sip_id, die_id, pe_id,
|
||||
pe_local_hbm_offset=offset, slice_size_bytes=cfg.hbm_slice_bytes)`.
|
||||
- Failure raises `AllocationError("HBM overflow ...")`.
|
||||
|
||||
`free_hbm(pa, nbytes)`:
|
||||
|
||||
- Recover PE-local offset via `pa.hbm_offset - pe_id *
|
||||
cfg.hbm_slice_bytes`.
|
||||
- `_hbm.free(offset, nbytes)`.
|
||||
|
||||
`alloc_tcm(nbytes) -> PhysAddr`: similar; uses `PhysAddr.pe_tcm_addr`.
|
||||
|
||||
`free_tcm(pa, nbytes)`: uses `pa.sub_offset` directly (TCM's PE-local
|
||||
offset equals its sub_offset).
|
||||
|
||||
The allocator does not see the scheduler-reserved TCM region
|
||||
(`cfg.tcm_scheduler_reserved_bytes`) — it is pre-subtracted from the
|
||||
`_tcm` capacity. This is consistent with ADR-0014's PE_SCHEDULER
|
||||
internal-buffer reservation.
|
||||
|
||||
### D4. `VirtualAllocator` — page-aligned first-fit + coalescing
|
||||
|
||||
`policy/address/va_allocator.py::VirtualAllocator`:
|
||||
|
||||
- Internal representation: same sorted `list[tuple[int, int]]` as
|
||||
`_FreeList`. Initially `[(va_base, va_size)]`.
|
||||
- `_align_up(nbytes) = ceil(nbytes / page_size) * page_size`.
|
||||
- `alloc(nbytes) -> int`:
|
||||
1. `aligned = _align_up(nbytes)`.
|
||||
2. First-fit a block with `size >= aligned`.
|
||||
3. Take `aligned` from the block's front; remove if exact.
|
||||
4. `_used += aligned`. Return the block's `start` (which is page-
|
||||
aligned).
|
||||
5. Failure → `VaAllocationError`.
|
||||
- `free(va, nbytes)`: free `_align_up(nbytes)` worth. Coalesces with
|
||||
the same algorithm as `_FreeList`.
|
||||
|
||||
`page_size` has different defaults in two places:
|
||||
|
||||
- `VirtualAllocator.__init__`'s parameter default: `2 MiB`. Direct-call
|
||||
tests receive this.
|
||||
- `RuntimeContext._ensure_allocators` when constructing the instance:
|
||||
`pe_mmu.attrs.get("page_size", 4096)` — uses
|
||||
`topology.yaml`'s `pe_mmu.attrs.page_size` if set, else falls back
|
||||
to `4 KiB`.
|
||||
|
||||
The two defaults differ on purpose: `VirtualAllocator`'s standalone
|
||||
default (`2 MiB`) aligns with ADR-0039's PE_MMU stopgap default for
|
||||
direct-test ergonomics; the context fallback (`4 KiB`) is the safe
|
||||
minimum when `topology.yaml` doesn't specify a page size. The
|
||||
production path is always the latter (via `_ensure_allocators`), and
|
||||
when `topology.yaml` sets `page_size`, that value flows consistently
|
||||
into both the MMU and the VA allocator.
|
||||
|
||||
If consistency breaks (e.g., VirtualAllocator instantiated with a
|
||||
page_size different from PE_MMU's), MMU `map()` falls into the
|
||||
sub-page region mode (ADR-0039 D3).
|
||||
|
||||
VA range defaults: `va_base = 0x1_0000_0000` (= 4 GiB), `va_size = 64
|
||||
GiB`. These are hardcoded in `_ensure_allocators` and have no
|
||||
semantic meaning in ADR-0011's VA model — they simply reserve enough
|
||||
device-wide space without colliding with host code.
|
||||
|
||||
### D5. Lifecycle of allocator instances
|
||||
|
||||
- `RuntimeContext._ensure_allocators` is lazy — called on the first
|
||||
`_create_tensor`.
|
||||
- The allocator dict (`self._allocators`) lives for the
|
||||
RuntimeContext's lifetime. A second deploy in the same process
|
||||
does not construct new objects.
|
||||
- `RuntimeContext.cleanup()` walks living tensors and calls
|
||||
`_free_tensor()`, which issues MMU unmaps + `va_allocator.free` +
|
||||
`pemem_allocator.free_hbm` — restoring the free lists. A subsequent
|
||||
RuntimeContext starts fresh.
|
||||
|
||||
This per-RuntimeContext isolation guarantees deterministic deploy →
|
||||
cleanup → deploy sequences within a single process.
|
||||
|
||||
### D6. Allocator failure raises (no silent OOM)
|
||||
|
||||
Both `_FreeList.alloc` and `VirtualAllocator.alloc` raise
|
||||
`AllocationError` / `VaAllocationError` when no block fits. The message
|
||||
includes "required size + largest available block" to distinguish
|
||||
fragmentation from true OOM.
|
||||
|
||||
A silent fallback (e.g., allocating only as much as the largest free
|
||||
block) is strictly forbidden — a partially-allocated tensor reaching
|
||||
SimPy would cause routing / DMA to see incorrect PAs and silently
|
||||
corrupt simulation results.
|
||||
|
||||
### D7. One allocator per address space
|
||||
|
||||
Physical address spaces are separated by PhysAddr sub-units (ADR-0001
|
||||
D2.3); each sub-unit gets its own allocator instance:
|
||||
|
||||
- HBM slice → `PEMemAllocator._hbm`.
|
||||
- PE TCM → `PEMemAllocator._tcm`.
|
||||
- (Currently unused) M_CPU local memory, CUBE SRAM → would need their
|
||||
own allocators. Today these are handled as IPCQ-only slots (ADR-0023
|
||||
D9.7) and do not share PA space, so no free-list exists for them.
|
||||
|
||||
When a cube-level SRAM allocator is needed,
|
||||
`_FreeList(cfg.sram_bytes_per_cube)` is added per-cube
|
||||
(`cfg.sram_bytes_per_cube` is already defined in `AddressConfig` —
|
||||
the data model is ready).
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. Best-fit / buddy allocator
|
||||
|
||||
Rejected (currently). The workload's alloc/free pattern is stack-like
|
||||
(deploy order ≈ free order), so first-fit + coalescing controls
|
||||
fragmentation well enough. If long-running fragmentation appears in LLM
|
||||
kernel sweeps, a buddy-allocator ADR will replace D1.
|
||||
|
||||
### A2. Add partial-overlap free validation
|
||||
|
||||
Rejected. D2's trust model plus the O(N) per-free cost makes this
|
||||
unattractive. A debug mode (e.g., `KERNBENCH_DEBUG` env var) that
|
||||
enables the check could be added later.
|
||||
|
||||
### A3. A unified allocator for VA and PA
|
||||
|
||||
Rejected. VA space (64 GiB device-wide) and PA space (per-slice ~6
|
||||
GiB) have different semantic dimensions — VA is the kernel's view, PA
|
||||
is the device sub-unit's view. ADR-0011's VA model (MMU maps between
|
||||
the two) calls for separated allocators.
|
||||
|
||||
### A4. Multi-tier page sizes (large pages + small pages)
|
||||
|
||||
Rejected (currently). A single page size (2 MiB) matches LLM kernel
|
||||
tensor sizes (a few MiB to GiB); smaller mappings are absorbed by
|
||||
ADR-0039 D3's sub-page region mode. Multi-tier paging would require
|
||||
extending the MMU model itself — a separate ADR candidate.
|
||||
|
||||
## Consequences
|
||||
|
||||
- The allocator algorithm is pinned at ADR level (D1, D3, D4), so any
|
||||
future simulation scenario hitting fragmentation has a clear "we're
|
||||
using first-fit + coalescing" anchor to inspect.
|
||||
- D2's trust model is explicit, so any future code path that exposes
|
||||
alloc/free to direct user input will trigger this ADR's supersession
|
||||
early.
|
||||
- D7's one-allocator-per-sub-unit mapping is on record, so when M_CPU
|
||||
or SRAM need their own free-list, the addition point is obvious.
|
||||
- D4 captures the page_size dual-default and its production path
|
||||
(`_ensure_allocators` always wins), letting future `topology.yaml`
|
||||
`page_size` changes be assessed against ADR-0039's stopgap
|
||||
interaction quickly.
|
||||
@@ -0,0 +1,247 @@
|
||||
# ADR-0049: `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
Pins down the traffic-pattern catalog, formula-vs-actual comparison, and
|
||||
invariant checks (monotonicity, D2H ≥ H2D, etc.) exposed by
|
||||
`probes/probe.py::run_probe(...)`. ADR-0010 (CLI surface) enumerates the
|
||||
`kernbench probe` subcommand, but **what probe actually measures** and
|
||||
**which invariants it judges PASS/FAIL** had no ADR-level coverage.
|
||||
|
||||
## First action
|
||||
|
||||
`run_probe(topology_path, case_filter=None)` performs four startup steps:
|
||||
|
||||
1. `Path(topology_path).expanduser().resolve()` → absolute path.
|
||||
2. `load_topology(path)` → `TopologyGraph` (graph + spec).
|
||||
3. `_build_edge_map(graph)` → a `{(src, dst): Edge}` lookup table.
|
||||
4. Instantiate `AddressResolver(graph)` + `PathRouter(graph)`.
|
||||
|
||||
Then it sets `nbytes = 32768` (= 32 KiB, the summary-table reference
|
||||
size) and `show_all = (case_filter is None or case_filter == "all")`.
|
||||
|
||||
In short, **probe's first act is "load the topology once and prepare
|
||||
edge map / resolver / router, plus pin 32 KiB as the standard measurement
|
||||
size"**. After that, the H2D → D2H → PE DMA categories execute in
|
||||
separate `GraphEngine` instances (no cross-talk between cases).
|
||||
|
||||
## Context
|
||||
|
||||
`kernbench probe` was introduced as a verification tool for these
|
||||
purposes:
|
||||
|
||||
- **Manual ground truth**: when a real-simulation result (`kernbench run
|
||||
--bench ...`) shows abnormal latency, derive the answer for a simple
|
||||
traffic pattern in isolation and compare.
|
||||
- **Formula vs actual**: check whether the analytical model
|
||||
(wire latency + overhead + drain) matches the simulator's
|
||||
`total_ns`. A mismatch points to which simplifying assumption in
|
||||
ADR-0033 is missing.
|
||||
- **Monotonicity check**: latency should grow monotonically with hop
|
||||
count.
|
||||
- **Utilization sweep**: a BW-utilization table across data sizes
|
||||
(4 KiB ~ 1 MiB).
|
||||
|
||||
Without an ADR for this tool:
|
||||
|
||||
- Adding a new traffic-pattern category (e.g., MCpuDma, IPCQ) is hard
|
||||
because the table format / measurement units of existing categories
|
||||
aren't documented at the ADR level.
|
||||
- The basis for the monotonicity check (hop count? cube distance? wire
|
||||
length?) is ambiguous.
|
||||
- The reference size 32 KiB and the sweep `[4 KiB, 16 KiB, 64 KiB, 256
|
||||
KiB, 1 MiB]` are only discoverable by reading source.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Three case categories — H2D / D2H / PE DMA
|
||||
|
||||
Each category has a distinct data path in the topology and gets its own
|
||||
summary table + sweep table + route-detail block.
|
||||
|
||||
- **H2D (Host → Device Write)**: `MemoryWriteMsg(dst_sip=0, dst_cube,
|
||||
dst_pe=0, pattern="zero")` flows along `pcie_ep → io_cpu → m_cpu →
|
||||
hbm_ctrl`. The cube index varies the hop count:
|
||||
- h2d-1hop: cube=0, hops=1
|
||||
- h2d-2hop: cube=4, hops=2
|
||||
- h2d-3hop: cube=8, hops=3
|
||||
- h2d-4hop: cube=12, hops=4
|
||||
- **D2H (Device → Host Read)**: `MemoryReadMsg(src_sip=0, src_cube,
|
||||
src_pe=0)`. Total latency = forward command path + reverse data path.
|
||||
Same 4-hops category as H2D.
|
||||
- **PE DMA (PE-initiated)**: `PeDmaMsg(src_sip, src_cube, src_pe,
|
||||
dst_pa)`. Five cases cover varying cube/PE positions:
|
||||
- pe-local-hbm: same cube, same PE
|
||||
- pe-same-half-hbm: same cube, different PE (PE 1)
|
||||
- pe-cross-half-hbm: same cube, far PE (PE 4)
|
||||
- pe-cross-cube-hbm-best: adjacent cube (cube 1)
|
||||
- pe-cross-cube-hbm-worst: diagonal far cube (cube 15)
|
||||
|
||||
The cube indices 4/8/12 (H2D) and 1/4/15 (PE DMA) are meaningful for a
|
||||
4 × 4 cube mesh (`sip.cube_mesh.w=4, h=4`); changes to the mesh size
|
||||
require these to be updated in lockstep.
|
||||
|
||||
### D2. Standard measurement size — `nbytes = 32768` (32 KiB)
|
||||
|
||||
Every case in the summary table runs once with `nbytes=32768`. 32 KiB
|
||||
was chosen because:
|
||||
|
||||
- DMA overhead and BW drain are balanced — neither dominates.
|
||||
- It compares cleanly against the one-shot transfer size of several
|
||||
sub-units (TCM, register file).
|
||||
|
||||
Per-size utilization variations are shown in a separate sweep table
|
||||
(D3).
|
||||
|
||||
### D3. Utilization sweep — `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]`
|
||||
|
||||
`SWEEP_SIZES = [4096, 16384, 65536, 262144, 1048576]`,
|
||||
`SWEEP_LABELS = ["4KB", "16KB", "64KB", "256KB", "1MB"]`. Per size:
|
||||
|
||||
```
|
||||
drain = nbytes / bottleneck_bw
|
||||
total = overhead + wire + drain
|
||||
eff_bw = nbytes / total
|
||||
util% = eff_bw / bottleneck_bw × 100
|
||||
```
|
||||
|
||||
When `bn_bw is None or <= 0`, the column shows 0.0 %. The intent: the
|
||||
table shows in one view how small transfers become overhead-bound and
|
||||
large transfers become drain-bound as hop count rises.
|
||||
|
||||
### D4. Measured columns — actual / formula / breakdown
|
||||
|
||||
Per-case columns:
|
||||
|
||||
- `Actual` (total_ns): the SimPy run's `trace["total_ns"]`.
|
||||
- `Ovhd`: sum of `node.attrs["overhead_ns"]` along the path (formula).
|
||||
- `Drain`: `nbytes / min(edge.bw_gbs over path)` (formula).
|
||||
- `Wire`: `Σ edge.distance_mm * (ns_per_mm from spec)`.
|
||||
- `Ovhd%` / `Drain%`: each portion as a percentage of Actual. Wire is
|
||||
usually too small to display.
|
||||
- `Eff.BW`: `nbytes / total_ns` (measured BW).
|
||||
- `BN.BW`: bottleneck bandwidth (formula). The minimum edge BW along
|
||||
the path. Missing edge BW shows "-".
|
||||
- `Util%`: `Eff.BW / BN.BW × 100`. 100 % means the single-stream BW
|
||||
upper bound is reached.
|
||||
|
||||
A large gap between the formula sum (`wire + ovhd + drain`) and Actual
|
||||
signals a factor the simplified model misses (a place to inspect
|
||||
ADR-0033's assumptions).
|
||||
|
||||
### D5. Automatic invariant checks — PASS/FAIL
|
||||
|
||||
The following invariants are reported with `[v] PASS` / `[x] FAIL`:
|
||||
|
||||
- **H2D / D2H monotonic increase**: as hop count rises, actual latency
|
||||
must grow monotonically. `all(lats[i] < lats[i+1] for ...)`.
|
||||
- **D2H ≥ H2D**: for the same hop index, D2H ≥ H2D (D2H has both
|
||||
forward command and reverse data legs). `all(d2h[i].total >=
|
||||
h2d[i].total)`.
|
||||
- **PE DMA best < worst**: cross-cube best (adjacent) latency must be
|
||||
less than cross-cube worst (diagonal).
|
||||
- **PE DMA local vs remote**: prints the local BN BW vs remote BN BW
|
||||
side-by-side (informational, not PASS/FAIL).
|
||||
|
||||
When a check fails, a single clear line surfaces the regression for
|
||||
human review.
|
||||
|
||||
### D6. Route detail — per-hop timestamp trace
|
||||
|
||||
After the summary and sweep tables, each case's path and cumulative
|
||||
per-hop timestamps (`_hop_timestamps`) appear in a separate section:
|
||||
|
||||
- H2D: leg1 (`pcie_ep → io_cpu`) + leg2 (`io_cpu → m_cpu`) + leg3
|
||||
(`m_cpu → hbm_ctrl`) + per-hop trace.
|
||||
- D2H: forward (cmd, no data) and reverse (data) traces shown
|
||||
separately.
|
||||
- PE DMA: `pe_dma → router → hbm_ctrl` path + per-hop trace.
|
||||
|
||||
Each hop's timestamp is cumulative `wire_ns + overhead_ns`. The
|
||||
terminal hop's annotation appends `drain:Xns`. Bottleneck edges are
|
||||
marked `<BN:XXGB/s>` so they are visually identifiable.
|
||||
|
||||
### D7. Semantics of the `case_filter` argument
|
||||
|
||||
- `None` or `"all"`: run all cases (default).
|
||||
- Other strings: run only the case whose name matches exactly. Example:
|
||||
`kernbench probe --case h2d-2hop`.
|
||||
|
||||
Within a category, cases with `name != case_filter` are skipped; if
|
||||
only one data point remains, the category's monotonicity / D2H ≥ H2D
|
||||
comparisons are naturally skipped.
|
||||
|
||||
The CLI parser's `--case` default is `"all"`, so omitting it runs
|
||||
everything.
|
||||
|
||||
### D8. Fresh GraphEngine per case
|
||||
|
||||
Each of the 4 H2D, 4 D2H, and 5 PE DMA cases runs in **its own
|
||||
GraphEngine** (`engine = GraphEngine(graph)`). Reasons:
|
||||
|
||||
- Isolate accumulated state (op_log, completion tracking, allocators)
|
||||
so cases do not cross-talk.
|
||||
- Guarantee one case's traffic does not perturb another case's BW
|
||||
measurement.
|
||||
|
||||
This isolation lets probe results be interpreted as **single-flow**
|
||||
per-case latency. Multi-flow contention measurement is handled by
|
||||
separate tooling (e.g., the `pe2pe_overview` plot or ADR-0033's
|
||||
multi-flow merging model).
|
||||
|
||||
### D9. Output-format stability
|
||||
|
||||
probe's stdout is meant for humans; precise column widths, separators,
|
||||
and whitespace are **not** a machine-readable contract. Automated tools
|
||||
that wish to parse probe output should use a separate JSON-output mode
|
||||
(not yet implemented).
|
||||
|
||||
The `[v]` / `[x]` prefix on PASS/FAIL lines is a stable CI grep anchor.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. Register probe as another bench (`@bench(name="probe")`)
|
||||
|
||||
Rejected. probe is a verification tool, not a bench — multi-engine
|
||||
execution for sweeps/analysis and PASS/FAIL invariant output are
|
||||
essential, none of which fits ADR-0045's "single device + single
|
||||
RuntimeContext" bench model.
|
||||
|
||||
### A2. Exit code 1 on monotonicity violation
|
||||
|
||||
Rejected (currently). probe is positioned as a human inspection tool —
|
||||
PASS/FAIL is printed and exit is 0. A wrapper can `grep "\[x\]"` to
|
||||
decide. A future `--strict` flag could opt into non-zero exits.
|
||||
|
||||
### A3. Externalize the case catalog to YAML
|
||||
|
||||
Rejected (currently). The 8 cases (4 H2D + 4 D2H + 5 PE DMA = 13 total)
|
||||
are hardcoded and their semantics are tightly bound to the mesh
|
||||
topology. Moving cube-index meaning (4, 8, 12 / 1, 4, 15) into YAML
|
||||
would require separate documentation and lose cohesion. Externalize
|
||||
only when case additions become frequent.
|
||||
|
||||
### A4. Add multi-flow contention measurement
|
||||
|
||||
Rejected (out of probe scope). D8's single-flow isolation is probe's
|
||||
core intent. Multi-flow contention belongs in a different area of the
|
||||
ADR-0033 latency model — either a separate tool or a new case
|
||||
category.
|
||||
|
||||
## Consequences
|
||||
|
||||
- probe's case catalog (D1) and measurement units (D2/D3) are pinned at
|
||||
ADR level, so new traffic categories know which table format to
|
||||
follow.
|
||||
- The semantics of the formula-vs-actual columns (D4) are locked in, so
|
||||
questions like "why is Drain% 5 % or 70 %?" can quickly be linked to
|
||||
ADR-0033 assumption checks.
|
||||
- Automatic invariant checks (D5) are pinned, so latency-model changes
|
||||
immediately catch monotonicity / D2H ≥ H2D regressions.
|
||||
- D8's case-isolation is explicit, so probe results are safe to read as
|
||||
single-flow measurements. If multi-flow is needed, a separate tool
|
||||
track is clearly required.
|
||||
- A2's strict-mode flag is recorded as a follow-up so CI integration
|
||||
has a minimal change path when requested.
|
||||
@@ -0,0 +1,322 @@
|
||||
# ADR-0050: CCL Algorithm Module Contract — `ccl/algorithms/*.py`
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
Pins down the interface, kernel signature, and addition workflow that a
|
||||
module under `src/kernbench/ccl/algorithms/` must satisfy in order to be
|
||||
used as a collective algorithm by the AHBM CCL backend (ADR-0047).
|
||||
ADR-0047 D3 states only that "the algorithm module must expose `kernel`,
|
||||
`kernel_args`, optionally `TOPO_NAME_TO_KIND`"; **the contract an
|
||||
algorithm-module author needs to follow** has had no ADR-level coverage.
|
||||
This ADR pairs with ADR-0045's bench-module contract.
|
||||
|
||||
## First action
|
||||
|
||||
An algorithm module is imported at two moments:
|
||||
|
||||
1. **AHBM backend entry**: when user code calls
|
||||
`dist.init_process_group(backend="ahbm")`,
|
||||
`AhbmCCLBackend.__init__` runs
|
||||
`self._algo_module = importlib.import_module(self._merged["module"])`.
|
||||
At module level, the following occur first:
|
||||
- Topology-kind integer constants like `SIP_TOPO_RING/TORUS/MESH`
|
||||
are bound in the module namespace.
|
||||
- The `TOPO_NAME_TO_KIND` dict is bound; the backend reads it via
|
||||
`getattr(self._algo_module, "TOPO_NAME_TO_KIND", None)`.
|
||||
- `kernel_args` function is defined for the caller.
|
||||
- The actual algorithm function (e.g.,
|
||||
`allreduce_intercube_multidevice`) is defined.
|
||||
- At the bottom of the module, `kernel = allreduce_intercube_multidevice`
|
||||
publishes the alias.
|
||||
|
||||
2. **ccl.yaml install stage**:
|
||||
`kernbench.ccl.install.install_ipcq` imports the same algorithm
|
||||
module while pushing the IPCQ neighbor table.
|
||||
|
||||
In short, **the algorithm module's first act is "publish topology-kind
|
||||
constants, the `TOPO_NAME_TO_KIND` dict, the `kernel_args` function, and
|
||||
the `kernel` alias into the module namespace"** — all as import-time
|
||||
side effects, no separate initialization call.
|
||||
|
||||
## Context
|
||||
|
||||
`AhbmCCLBackend` (ADR-0047), at process-group creation, dynamically
|
||||
imports a module path obtained from `ccl.yaml`'s `defaults.algorithm` (or
|
||||
a user-specified algorithm). The backend expects four things from the
|
||||
module:
|
||||
|
||||
- `kernel`: the collective's entry function.
|
||||
- `kernel_args(world_size, n_elem, cube_w=, cube_h=) -> tuple`: a tuple
|
||||
packing the kernel's positional arguments.
|
||||
- `TOPO_NAME_TO_KIND` (optional): a dict mapping `topology.yaml`'s
|
||||
`sips.topology` string (e.g., `"ring_1d"`, `"torus_2d"`,
|
||||
`"mesh_2d_no_wrap"`) to the integer kind constants.
|
||||
- (Indirectly) IPCQ neighbor-table install:
|
||||
`configure_sfr_intercube_multisip` reads
|
||||
the module's `TOPO_NAME_TO_KIND` plus cube dimensions to decide the
|
||||
SFR.
|
||||
|
||||
The current corpus has one algorithm module:
|
||||
`lrab_hierarchical_allreduce.py` (248 lines). The name expands to
|
||||
"**l**eft-**r**ight **a**lternating **b**roadcast hierarchical allreduce".
|
||||
When future modules like `ring_allreduce`, `tree_allreduce`, or
|
||||
`broadcast` are added, they must follow this contract for the backend's
|
||||
dispatch path to keep working.
|
||||
|
||||
Without an ADR-level contract:
|
||||
|
||||
- A new algorithm author has to infer the signature from ADR-0047 D3's
|
||||
one-liner.
|
||||
- The kernel-function argument order (especially `t_ptr, n_elem,
|
||||
cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w,
|
||||
sip_topo_h, tl`) is unclear without grep.
|
||||
- It is conventional, but not documented, what `kernel_args` takes as
|
||||
inputs and what tuple it must return.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. The algorithm module exposes four public symbols
|
||||
|
||||
```python
|
||||
# src/kernbench/ccl/algorithms/<name>.py
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
# (required) topology-kind constants — referenced internally
|
||||
SIP_TOPO_RING = 0
|
||||
SIP_TOPO_TORUS = 1
|
||||
SIP_TOPO_MESH = 2
|
||||
|
||||
# (optional) topology name → kind mapping. Used by the backend to
|
||||
# translate ccl.yaml/topology's string SIP topology into an integer.
|
||||
TOPO_NAME_TO_KIND = {
|
||||
"ring_1d": SIP_TOPO_RING,
|
||||
"torus_2d": SIP_TOPO_TORUS,
|
||||
"mesh_2d_no_wrap": SIP_TOPO_MESH,
|
||||
}
|
||||
|
||||
# (required) kernel argument builder
|
||||
def kernel_args(world_size: int, n_elem: int, *, cube_w: int = 4, cube_h: int = 4) -> tuple:
|
||||
return (n_elem, cube_w, cube_h, world_size)
|
||||
|
||||
# (required) kernel function (TLContext is injected via the `tl=...`
|
||||
# keyword argument).
|
||||
def my_allreduce_kernel(t_ptr, n_elem, cube_w, cube_h, n_sips,
|
||||
sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, *, tl):
|
||||
...
|
||||
|
||||
# (required) kernel alias — the backend accesses `module.kernel`
|
||||
kernel = my_allreduce_kernel
|
||||
```
|
||||
|
||||
- The `kernel` alias is the entry point the backend invokes. Whatever
|
||||
the function name is (e.g., `allreduce_intercube_multidevice`), it
|
||||
must be exposed via `module.kernel = fn`.
|
||||
- Without `kernel_args`, the backend has no way to build the
|
||||
algorithm's argument list. See D2 for the signature.
|
||||
- If `TOPO_NAME_TO_KIND` is absent, the backend falls back to
|
||||
`sip_topo_kind = 0`. An algorithm supporting only a single topology
|
||||
may omit it.
|
||||
|
||||
### D2. `kernel_args` signature — `(world_size, n_elem, *, cube_w, cube_h)`
|
||||
|
||||
```python
|
||||
def kernel_args(world_size: int, n_elem: int, *,
|
||||
cube_w: int = 4, cube_h: int = 4) -> tuple:
|
||||
return (n_elem, cube_w, cube_h, world_size)
|
||||
```
|
||||
|
||||
- **Positional arguments**: `world_size` (= number of ranks), `n_elem`
|
||||
(= element count of a single shard, f16-based).
|
||||
- **Keyword arguments**: `cube_w`, `cube_h` (= cube-mesh dimensions).
|
||||
Default 4×4 — aligned with `topology.yaml`'s `sip.cube_mesh` default.
|
||||
- **Return**: a tuple in the order the kernel's positional arguments
|
||||
expect.
|
||||
|
||||
When the backend calls `all_reduce`:
|
||||
|
||||
```python
|
||||
kernel_args_tuple = self._algo_module.kernel_args(
|
||||
self._world_size, n_elem, cube_w=eff_cube_w, cube_h=eff_cube_h,
|
||||
)
|
||||
extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)
|
||||
pending = self.ctx.launch(
|
||||
self._merged["algorithm"], kernel_fn, tensor,
|
||||
*kernel_args_tuple, *extra_args, _defer_wait=True,
|
||||
)
|
||||
```
|
||||
|
||||
So the kernel's full positional argument list becomes: `(tensor_ptr,
|
||||
*kernel_args_tuple, sip_rank, sip_topo_kind, sip_topo_w,
|
||||
sip_topo_h)`, with `tl=...` injected as a keyword. The tuple length
|
||||
and order returned by `kernel_args` must **match the kernel signature
|
||||
1:1**.
|
||||
|
||||
### D3. Kernel signature — standardized 9 + tl arguments
|
||||
|
||||
Recommended signature:
|
||||
|
||||
```python
|
||||
def my_kernel(
|
||||
t_ptr: int, # VA base of the row-wise-sharded tensor on this SIP
|
||||
n_elem: int, # element count per cube tile (or per shard)
|
||||
cube_w: int, # cube mesh width (from kernel_args)
|
||||
cube_h: int, # cube mesh height (from kernel_args)
|
||||
n_sips: int, # equal to world_size (rank = SIP, ADR-0024)
|
||||
sip_rank: int, # this SIP's rank
|
||||
sip_topo_kind: int, # result of TOPO_NAME_TO_KIND lookup
|
||||
sip_topo_w: int, # SIP mesh width (0 for ring_1d)
|
||||
sip_topo_h: int, # SIP mesh height (0 for ring_1d)
|
||||
*, tl, # TLContext (auto-injected)
|
||||
) -> None:
|
||||
```
|
||||
|
||||
Even if `kernel_args` chose a different positional argument order, the
|
||||
kernel's **last four positional arguments are always
|
||||
`(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)`** — the backend
|
||||
appends them as `extra_args` (ADR-0047 D5). A custom algorithm must
|
||||
accept these four, but a single-SIP algorithm may simply ignore them.
|
||||
|
||||
`tl` is injected via keyword — `RuntimeContext.launch` adds `tl=tl_ctx`
|
||||
just before invoking the kernel. The signature therefore exposes `tl`
|
||||
as keyword-only (`*, tl`) or as the trailing keyword parameter.
|
||||
|
||||
### D4. Kernel body — freedom and constraints
|
||||
|
||||
Available inside the kernel: every `tl.*` primitive from ADR-0046 D3.
|
||||
|
||||
Common patterns:
|
||||
|
||||
- `cube_id = tl.program_id(axis=1)` — this PE's cube index.
|
||||
- `pe_addr = t_ptr + cube_id * nbytes` — per-cube VA of the tile.
|
||||
- `acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")` — load local
|
||||
data.
|
||||
- `tl.send(dir=...)` / `tl.recv(dir=..., shape=, dtype=)` — IPCQ
|
||||
collective.
|
||||
- `acc = acc + recv` — TensorHandle arithmetic operators (ADR-0046 D4).
|
||||
- `tl.store(pe_addr, acc)` — store the result.
|
||||
|
||||
The kernel body is plain Python — branching and loops are fine. But:
|
||||
|
||||
- No SimPy `yield` or `async` (ADR-0046 D1).
|
||||
- No direct access to TensorHandle `.data` — the Phase 1 timing model
|
||||
doesn't see data dependencies (ADR-0020's 2-pass separation).
|
||||
- Kernel execution must be deterministic — the same input must produce
|
||||
the same op sequence. No random or external IO.
|
||||
|
||||
### D5. SIP topology semantics — meaning of `sip_topo_kind`
|
||||
|
||||
The backend looks up `topology.yaml`'s `system.sips.topology` string
|
||||
in the algorithm module's `TOPO_NAME_TO_KIND` and passes the integer
|
||||
as `sip_topo_kind`. The algorithm then branches:
|
||||
|
||||
```python
|
||||
if sip_topo_kind == SIP_TOPO_RING:
|
||||
acc = _inter_sip_ring(...)
|
||||
elif sip_topo_kind == SIP_TOPO_TORUS:
|
||||
acc = _inter_sip_torus_2d(...)
|
||||
elif sip_topo_kind == SIP_TOPO_MESH:
|
||||
acc = _inter_sip_mesh_2d(...)
|
||||
```
|
||||
|
||||
Each topology branch communicates with peers via IPCQ direction names
|
||||
(`"global_E"`, `"W"`, `"S"`, `"N"` …). Direction semantics are defined
|
||||
in ADR-0023/0025; `configure_sfr_intercube_multisip` installs the IPCQ
|
||||
neighbor table accordingly.
|
||||
|
||||
If a topology kind not supported by the algorithm appears, prefer an
|
||||
explicit `raise ValueError(f"unsupported topology kind
|
||||
{sip_topo_kind}")` over a silent no-op — fail fast on misconfiguration.
|
||||
|
||||
### D6. The `ccl.yaml` algorithm entry
|
||||
|
||||
The algorithm module is paired with a `ccl.yaml` entry (ADR-0023 D10 +
|
||||
ADR-0047 D3):
|
||||
|
||||
```yaml
|
||||
defaults:
|
||||
algorithm: lrab_hierarchical_allreduce
|
||||
n_elem: 8
|
||||
|
||||
algorithms:
|
||||
lrab_hierarchical_allreduce:
|
||||
module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
|
||||
# optional: world_size override
|
||||
# optional: per-algorithm parameters consumed by configure_sfr_intercube_multisip
|
||||
```
|
||||
|
||||
- `module`: the full Python module path; `importlib.import_module`
|
||||
consumes this string as-is.
|
||||
- `world_size` (optional): when set, overrides the topology fallback
|
||||
(ADR-0047 D2).
|
||||
- Algorithm-specific parameters are consumed by
|
||||
`configure_sfr_intercube_multisip`.
|
||||
|
||||
Workflow to add a new algorithm:
|
||||
|
||||
1. Write `src/kernbench/ccl/algorithms/<name>.py` following D1.
|
||||
2. Add the entry under `algorithms` in `ccl.yaml`.
|
||||
3. (If needed) extend `kernbench.ccl.sfr_config` with the SFR-install
|
||||
branch.
|
||||
4. Add tests (e.g., `tests/sccl/test_<name>.py`, extending the
|
||||
ADR-0043 eval harness).
|
||||
|
||||
### D7. Legacy "rank = flat PE index" mode
|
||||
|
||||
The `world_size` override in `ccl.yaml`, surfaced by ADR-0047 D2, is
|
||||
used by legacy "rank = flat PE index" tests. The algorithm module can
|
||||
assume `n_sips=world_size` ranks even in this mode — the backend
|
||||
maintains the rank↔(SIP, cube, PE) mapping, so no modal branching is
|
||||
needed inside the algorithm body.
|
||||
|
||||
In single-cube workloads (where `cube_w=cube_h=1`), the algorithm must
|
||||
skip mesh-based phases — see the
|
||||
`single_cube = (cube_w == 1 and cube_h == 1)` pattern in
|
||||
`lrab_hierarchical_allreduce.py`.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. Organize the algorithm module as a class (`class Allreduce: kernel(...)`)
|
||||
|
||||
Rejected. The Python module namespace already identifies an algorithm
|
||||
(see ADR-0047 D3's `importlib.import_module`). A class wrapper adds
|
||||
indirection without simplifying dispatch. Module-level free functions
|
||||
plus a `kernel` alias are clean and obvious.
|
||||
|
||||
### A2. Type `kernel_args` with an explicit dataclass
|
||||
|
||||
Rejected (currently). Each algorithm normally has a different argument
|
||||
count; forcing one dataclass would hurt cross-algorithm interchange.
|
||||
The tuple return is simple and unpacks cleanly with the backend's
|
||||
`*kernel_args_tuple`. If an algorithm wants stronger internal typing,
|
||||
it may define its own NamedTuple.
|
||||
|
||||
### A3. Move SFR installation inside the algorithm module
|
||||
|
||||
Rejected. SFR installation
|
||||
(`configure_sfr_intercube_multisip`) is a cross-module decision
|
||||
combining topology + algorithm; `kernbench.ccl.sfr_config` is a more
|
||||
natural home than the algorithm module itself. D6's "extend
|
||||
sfr_config if needed" workflow keeps responsibility boundaries clear.
|
||||
|
||||
### A4. Auto-register algorithm names via a decorator (analogous to ADR-0045's `@bench`)
|
||||
|
||||
Rejected. Unlike benches, algorithms are already tied to `ccl.yaml`
|
||||
entries; an additional registry would be redundant. The string mapping
|
||||
in `module` is sufficient.
|
||||
|
||||
## Consequences
|
||||
|
||||
- ADR-0047 D3's one-line contract expands to a D1–D7 author-facing
|
||||
guide; new algorithm signatures no longer need to be grep-derived.
|
||||
- D3's standardized 9 + tl signature couples naturally with the
|
||||
backend's `extra_args` append (ADR-0047 D5). It is explicit that
|
||||
even single-SIP-only algorithms must accept the four `sip_*` trailing
|
||||
arguments.
|
||||
- D5's fail-loud recommendation means a `ccl.yaml` topology that the
|
||||
algorithm doesn't support will surface as an explicit `ValueError`
|
||||
rather than a silent wrong result.
|
||||
- D6's step-by-step addition workflow makes clear how far a new
|
||||
algorithm has to reach into sfr_config / tests / ccl.yaml.
|
||||
@@ -0,0 +1,288 @@
|
||||
# ADR-0051: Routing Helper API — `AddressResolver` + `PathRouter`
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
Pins down every public API, argument, return value, and adjacency-graph
|
||||
selection of the two helper classes (`AddressResolver`, `PathRouter`)
|
||||
exposed by `policy/routing/router.py`. ADR-0002 defines routing
|
||||
distance, ordering, and bypass rules, but **the helper API surface
|
||||
itself** has had no ADR-level coverage.
|
||||
|
||||
## First action
|
||||
|
||||
### `AddressResolver(graph)`
|
||||
|
||||
On construction, caches two pieces of state:
|
||||
|
||||
1. `self._node_ids = set(graph.nodes)` — a set of all node ids for
|
||||
lookup.
|
||||
2. `self._hbm_slice_bytes = hbm_total_gb * (1 << 30) // slices_per_cube`
|
||||
— derived from `graph.spec.cube.memory_map` (default `48 GB / 8
|
||||
slices = 6 GB`). `resolve()` uses this value to decode `pe_id` from
|
||||
an HBM PA's `hbm_offset`.
|
||||
|
||||
In short, **AddressResolver's first act is "precompute the full set of
|
||||
node ids and the HBM slice size"**. It does not retain the graph
|
||||
itself.
|
||||
|
||||
### `PathRouter(graph)`
|
||||
|
||||
On construction, **builds four separate adjacency graphs in one pass**:
|
||||
|
||||
1. `self._adj_all`: every edge (used for component-to-component
|
||||
routing).
|
||||
2. `self._adj`: edges with `kind != "command"` (PE DMA / generic data
|
||||
paths).
|
||||
3. `self._adj_mcpu_dma`: excludes
|
||||
`_MCPU_DMA_EXCLUDE = {"pe_internal", "pe_to_router"}` (M_CPU DMA
|
||||
must not pass through PE pipeline nodes).
|
||||
4. `self._adj_local`: excludes the 8-element `_UCIE_KINDS` set (UCIe
|
||||
would look like a zero-distance bus to Dijkstra, which would prefer
|
||||
it over the mesh — for cube-local routing this must be avoided).
|
||||
|
||||
Each graph is a `defaultdict(list)` of `(neighbor, weight)`. The
|
||||
weight is `edge.routing_weight_mm or edge.distance_mm`.
|
||||
|
||||
In short, **PathRouter's first act is "classify topology edges into
|
||||
four policy-specific adjacency lists simultaneously"**. Each `find_*()`
|
||||
call picks the appropriate graph and runs Dijkstra.
|
||||
|
||||
## Context
|
||||
|
||||
`policy/routing/router.py` performs two responsibilities together:
|
||||
|
||||
- **Naming**: it is the sole owner of the topology naming convention
|
||||
(`sip{S}.cube{C}.<comp>`, `sip{S}.io{I}.pcie_ep`, etc.). Components /
|
||||
probe / IPCQ install / runtime API do not build node-id strings
|
||||
themselves — they call helpers.
|
||||
- **Path decisions**: policy separation by `edge.kind`. For the same
|
||||
src→dst, different routing intents (PE DMA vs M_CPU DMA vs general
|
||||
component routing) call for different adjacencies and so produce
|
||||
different paths.
|
||||
|
||||
This helper API is widely consumed (probe.py / distributed.py /
|
||||
install.py / various components / tests), yet **the exact signatures /
|
||||
return semantics / adjacency picks** are not gathered in any ADR. This
|
||||
ADR closes that gap.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. `AddressResolver` exposes five public methods
|
||||
|
||||
#### D1.1. `resolve(addr: PhysAddr) -> str`
|
||||
|
||||
Translates a `PhysAddr` to a destination node id in the topology:
|
||||
|
||||
```
|
||||
addr.kind == "hbm" → f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
|
||||
where pe_id = addr.hbm_offset // self._hbm_slice_bytes (ADR-0017 D4/D9)
|
||||
|
||||
addr.kind == "pe_resource":
|
||||
addr.unit_type == PE → f"sip{s}.cube{d}.pe{addr.pe_id}.pe_tcm"
|
||||
addr.unit_type == SRAM → f"sip{s}.cube{d}.sram"
|
||||
addr.unit_type == MCPU → f"sip{s}.cube{d}.m_cpu"
|
||||
others → RoutingError("unsupported unit_type")
|
||||
|
||||
other kinds → RoutingError("unsupported address kind")
|
||||
```
|
||||
|
||||
If the derived node id is not in `self._node_ids`, raises
|
||||
`RoutingError(f"node {node_id} not found in topology")`. So even when
|
||||
the address has valid syntax, an absent node in the topology
|
||||
fails-loud.
|
||||
|
||||
#### D1.2. `find_m_cpu(sip, cube) -> str`
|
||||
|
||||
Returns `f"sip{sip}.cube{cube}.m_cpu"`; absent → `RoutingError`.
|
||||
|
||||
#### D1.3. `find_pcie_ep(sip, io_id="io0") -> str`
|
||||
|
||||
Returns `f"sip{sip}.{io_id}.pcie_ep"`; absent → `RoutingError`.
|
||||
|
||||
#### D1.4. `find_io_cpu(sip, io_id="io0") -> str`
|
||||
|
||||
Returns `f"sip{sip}.{io_id}.io_cpu"`; absent → `RoutingError`.
|
||||
|
||||
#### D1.5. `find_all_pcie_eps() -> list[str]`
|
||||
|
||||
All PCIE_EP node ids across all SIPs, sorted. Filtered by
|
||||
`endswith(".pcie_ep")`. Cross-SIP IPCQ uses this when enumerating
|
||||
PCIE_EPs.
|
||||
|
||||
This class is the sole owner of the naming convention
|
||||
(`sip{S}.cube{C}.<comp>`, `sip{S}.{io_id}.<comp>`) — ADR-0015 D4.
|
||||
The topology builder produces nodes with the same naming convention;
|
||||
components never build node-id strings directly — they go through
|
||||
these helpers.
|
||||
|
||||
### D2. `PathRouter`'s four adjacency graphs
|
||||
|
||||
Constructed in one pass. `edge.kind` drives policy:
|
||||
|
||||
| graph | excluded edge kinds | use case |
|
||||
|-------------------|--------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|
|
||||
| `_adj_all` | (none) | M_CPU↔NOC command included, IO_CPU/M_CPU routes |
|
||||
| `_adj` | `"command"` | PE DMA / generic data paths |
|
||||
| `_adj_mcpu_dma` | `"pe_internal"`, `"pe_to_router"` | M_CPU DMA (skips PE pipeline) |
|
||||
| `_adj_local` | `_UCIE_KINDS` (`ucie_internal`, `ucie_conn_to_router`, `router_to_ucie_conn`, `ucie_conn_to_noc`, `noc_to_ucie_conn`, `ucie_mesh`, `io_to_cube`, `cube_to_io`) | same-cube routing (UCIe bus excluded) |
|
||||
|
||||
Each graph is `dict[node_id, list[(neighbor, weight)]]` with weight =
|
||||
`edge.routing_weight_mm or edge.distance_mm`. Excluding command edges
|
||||
prevents them from influencing routing; isolating `_adj_local` keeps
|
||||
UCIe's "zero-distance bus" from out-competing the mesh — consistent
|
||||
with ADR-0017 D7's cross-PE-slice mesh-distance requirement.
|
||||
|
||||
### D3. `PathRouter` exposes six public methods (+ two backward-compat shims)
|
||||
|
||||
#### D3.1. `find_path(src_pe: str, dst_node: str) -> list[str]`
|
||||
|
||||
**PE DMA routing**. `src_pe` is a PE prefix (e.g.,
|
||||
`"sip0.cube0.pe0"`); the function auto-prepends `.pe_dma`, making the
|
||||
true start node `"sip0.cube0.pe0.pe_dma"`.
|
||||
|
||||
Adjacency depends on cube-locality (`_same_cube`):
|
||||
|
||||
- **Same-cube** (src and dst share `sip{S}.cube{C}.` prefix): uses
|
||||
`_adj_local`. Excluding UCIe lets cross-PE-slice access pay accurate
|
||||
mesh distance (ADR-0017 D7).
|
||||
- **Cross-cube**: uses `_adj`. UCIe naturally becomes the right choice
|
||||
for the cross-cube portion.
|
||||
|
||||
#### D3.2. `find_path_with_distance(src_pe, dst_node) -> tuple[list[str], float]`
|
||||
|
||||
Same adjacency policy as D3.1, but returns `(path, total_distance)`.
|
||||
Used by probe and analysis tools that need the distance metric.
|
||||
|
||||
#### D3.3. `find_mcpu_dma_path(m_cpu_id: str, dst_hbm_id: str) -> list[str]`
|
||||
|
||||
**M_CPU DMA path**. Same cube → `_adj_local` (stay within the mesh);
|
||||
different cube → `_adj_all` (cross via UCIe). The
|
||||
`_MCPU_DMA_EXCLUDE` set ensures PE-pipeline nodes never appear on
|
||||
M_CPU's routes.
|
||||
|
||||
#### D3.4. `find_memory_path(src: str, dst: str) -> list[str]`
|
||||
|
||||
Direct memory path like
|
||||
`pcie_ep → io_noc → cube → router mesh → hbm_ctrl`. Uses
|
||||
`_adj_mcpu_dma` to exclude `pe_internal` and `pe_to_router`, so
|
||||
host-issued reads/writes never leak into the PE pipeline. Probe
|
||||
(ADR-0049 D1's H2D/D2H cases) calls this directly.
|
||||
|
||||
#### D3.5. `find_node_path(src: str, dst: str) -> list[str]`
|
||||
|
||||
Generic routing between arbitrary nodes, **including command edges**
|
||||
(via `_adj_all`). IoCpuComponent / MCpuComponent use this when they
|
||||
need to route through M_CPU ↔ NOC command-kind links.
|
||||
|
||||
#### D3.6. Backward-compat shims
|
||||
|
||||
- `_dijkstra(start, goal) -> list[str]` — thin wrapper for
|
||||
`_run_dijkstra(self._adj, …)`.
|
||||
- `_dijkstra_with_dist(start, goal) -> tuple[list[str], float]` —
|
||||
distance-aware variant.
|
||||
|
||||
Despite the underscore prefixes (suggesting internal API), existing
|
||||
tests call these directly. New code should prefer D3.1–D3.5; these two
|
||||
shims are deprecation candidates.
|
||||
|
||||
### D4. Dijkstra — single-source shortest path
|
||||
|
||||
`_run_dijkstra_with_dist(adj, start, goal)`:
|
||||
|
||||
- `heapq` priority queue.
|
||||
- `best: dict[node, distance]` — best known distance to each node.
|
||||
- `prev: dict[node, predecessor]` — for path reconstruction.
|
||||
- Edge weight = `routing_weight_mm or distance_mm`. The separation
|
||||
matters because UCIe (and a few others) declare an explicit
|
||||
`routing_weight_mm` distinct from physical `distance_mm`.
|
||||
|
||||
`start == goal` short-circuits to `([start], 0.0)`. Unreachable target
|
||||
→ `RoutingError(f"no path from {start} to {goal}")`.
|
||||
|
||||
The algorithm is **deterministic**: identical graph + start/goal gives
|
||||
the same path, satisfying SPEC R1 ("routing MUST be deterministic").
|
||||
Tie-breaks follow `heapq`'s push order (Python list order is
|
||||
deterministic).
|
||||
|
||||
### D5. Single-owner principle for helper-API decisions
|
||||
|
||||
The following decisions live only inside router.py:
|
||||
|
||||
- Naming convention: `sip{S}.cube{C}.<comp>`,
|
||||
`sip{S}.{io_id}.<comp>`,
|
||||
`sip{S}.cube{C}.hbm_ctrl.pe{pe_id}`.
|
||||
- Adjacency policy: which edge kinds belong to which graph.
|
||||
- Algorithm for recovering PE id from an HBM slice size.
|
||||
- Dijkstra weight selection
|
||||
(`routing_weight_mm or distance_mm`).
|
||||
|
||||
Breaking single ownership (e.g., a component starting to build
|
||||
`f"sip{s}..."` itself) would explode the blast radius of naming-
|
||||
convention changes. This aligns with ADR-0015 D4.
|
||||
|
||||
### D6. Consumers of the helper API
|
||||
|
||||
Methods listed in this ADR are called from (current corpus):
|
||||
|
||||
- `probes/probe.py` (ADR-0049): `find_pcie_ep`, `find_io_cpu`,
|
||||
`find_m_cpu`, `find_node_path`, `find_mcpu_dma_path`,
|
||||
`find_memory_path`, `find_path`, `resolve`.
|
||||
- `runtime_api/distributed.py` (ADR-0047): indirectly (engine-internal
|
||||
routing).
|
||||
- `ccl/install.py` (ADR-0023): `find_all_pcie_eps`, `resolve`.
|
||||
- `sim_engine/event_log.py`: like probe — `find_pcie_ep`,
|
||||
`find_memory_path`.
|
||||
- `components/builtin/m_cpu.py`, `components/builtin/io_cpu.py`:
|
||||
`find_node_path`, `find_mcpu_dma_path`.
|
||||
- Tests (test_routing.py, test_cross_sip_routing.py, …): most of
|
||||
D3.1–D3.5.
|
||||
|
||||
When a new consumer arrives, D1/D3 act as a first-pass guide on
|
||||
whether an existing method matches the intent or a new one is needed.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. One adjacency graph + per-call edge-kind filtering
|
||||
|
||||
Rejected. Re-filtering the graph on every `find_*()` call hurts
|
||||
Dijkstra cache locality. Constructing four graphs in one pass (D2)
|
||||
has modest memory cost (edges ≤ a few × 10⁴), and selection happens
|
||||
in O(1) at call time.
|
||||
|
||||
### A2. Drive adjacency separation by separate edge metadata rather than `kind`
|
||||
|
||||
Rejected. `edge.kind` is already assigned by the topology builder
|
||||
(ADR-0015 D4 + ADR-0017); a parallel metadata field would force
|
||||
synchronization between two systems.
|
||||
|
||||
### A3. Use BFS with uniform weights instead of Dijkstra
|
||||
|
||||
Rejected. With per-edge `routing_weight_mm` (mesh link / UCIe /
|
||||
IO-internal), BFS minimizes hop count rather than total
|
||||
latency/distance. SPEC R1 + R2 require deterministic and accurate
|
||||
routing, which BFS does not deliver.
|
||||
|
||||
### A4. Express the helper API as module functions instead of classes
|
||||
|
||||
Rejected. Each class
|
||||
(`AddressResolver`, `PathRouter`) maintains caches
|
||||
(`_node_ids`, `_hbm_slice_bytes`, four adjacency graphs) reused across
|
||||
many routing queries on the same graph. Module functions would have
|
||||
to rebuild state per call or go global, hurting safety and
|
||||
performance.
|
||||
|
||||
## Consequences
|
||||
|
||||
- When components / probe / IPCQ install / runtime API all go through
|
||||
router.py helpers, a naming-convention change (e.g., `.io0.` →
|
||||
`.iochiplet0.`) is a one-file edit (D5).
|
||||
- D2's four-graph split is now ADR-locked, so when a new edge kind is
|
||||
added (e.g., a new inter-die UCIe-link kind), the right adjacency
|
||||
category is decided explicitly rather than by default.
|
||||
- D3.1's same-cube vs cross-cube branching (ADR-0017 D7) is explicit,
|
||||
so anyone changing routing knows which adjacency to touch.
|
||||
- D6's consumer list bounds PR-review scope for helper-API changes,
|
||||
and the backward-compat shims (D3.6) are flagged as deprecation
|
||||
candidates.
|
||||
@@ -0,0 +1,371 @@
|
||||
# ADR-0052: OpLog + MemoryStore Schemas — sim_engine internals
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
Pins down the `OpRecord` schema and the `record_start` / `record_end` /
|
||||
`record_copy` behavior in `sim_engine/op_log.py`, plus the
|
||||
(space, addr) namespace and read/write semantics of `MemoryStore` in
|
||||
`sim_engine/memory_store.py`. ADR-0020 (2-pass data execution) declares
|
||||
that these two facilities exist, but **the precise record fields and
|
||||
semantics** had no ADR-level coverage, and several recent ADRs
|
||||
(ADR-0046 D3.2's `tl.store` visibility, ADR-0023 D9's IPCQ copy
|
||||
record) depend on these semantics.
|
||||
|
||||
## First action
|
||||
|
||||
### `OpLogger(memory_store=None)`
|
||||
|
||||
On construction, initialize three fields:
|
||||
|
||||
1. `self._records: list[OpRecord] = []` — accumulated records.
|
||||
2. `self._pending: dict[int, dict] = {}` — partial records keyed by
|
||||
`id(msg)` (created at `record_start`, completed at `record_end`).
|
||||
3. `self._memory_store = memory_store` — optional MemoryStore
|
||||
reference. Used to capture math-op input snapshots and dma_write
|
||||
HBM-source snapshots.
|
||||
|
||||
Records and pending are empty; the `record_*` calls accumulate data
|
||||
over time.
|
||||
|
||||
### `MemoryStore()`
|
||||
|
||||
On construction, initialize a single field:
|
||||
`self._storage: dict[str, dict[int, np.ndarray]] = {}` — a two-level
|
||||
dict (`space → addr → ndarray`). Inner dicts are created lazily as new
|
||||
spaces appear.
|
||||
|
||||
In short, **both facilities' first act is "set up an empty accumulator
|
||||
buffer plus a sparse, per-space dict"**. The first record / write
|
||||
fills the fields when it arrives.
|
||||
|
||||
## Context
|
||||
|
||||
ADR-0020 D2/D5/D7 (2-pass data execution) declares:
|
||||
|
||||
- During Phase 1 (timing), `ComponentBase._on_process_start/end` hooks
|
||||
call `OpLogger.record_start/end`, recording the time and metadata of
|
||||
every data op.
|
||||
- Phase 2 (data) replays the op log in `t_start` order to compute real
|
||||
data.
|
||||
- Data payloads live in `MemoryStore`, keyed by (space, addr).
|
||||
|
||||
Subsequent ADRs (ADR-0023 D9's IPCQ atomic write, ADR-0027's Megatron
|
||||
TP scratch-overwrite avoidance, ADR-0046 D3.2's `tl.store` visibility)
|
||||
depend on op_log and MemoryStore behavior, but **the exact record
|
||||
fields / space names / snapshot timing** are only discoverable via
|
||||
source grep. This ADR codifies them.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. `OpRecord` schema — seven fields
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class OpRecord:
|
||||
t_start: float
|
||||
t_end: float
|
||||
component_id: str
|
||||
op_kind: str # "memory" | "gemm" | "math" | "unknown"
|
||||
op_name: str # e.g. "dma_read", "gemm_f16", "exp",
|
||||
# "TileToken/DMA_READ", "composite_gemm",
|
||||
# "ipcq_copy"
|
||||
params: dict[str, Any]
|
||||
dependency_ids: list[int] = field(default_factory=list)
|
||||
```
|
||||
|
||||
- **`t_start` / `t_end`**: SimPy time (float ns). `t_start` is when the
|
||||
component begins the op; `t_end` is completion. Duration =
|
||||
`t_end - t_start`.
|
||||
- **`component_id`**: the node id where the op occurred (e.g.,
|
||||
`"sip0.cube0.pe0.pe_dma"`).
|
||||
- **`op_kind`**: one of four. Phase 2 DataExecutor branches on this.
|
||||
- **`op_name`**: a debug/analysis-friendly name. For a TileToken,
|
||||
expands to `"TileToken/{stage_type}"` (e.g.,
|
||||
`"TileToken/DMA_READ"`) to disambiguate stages.
|
||||
- **`params`**: op-specific metadata dict (see D3).
|
||||
- **`dependency_ids`**: currently unused (default `[]`). Reserved for
|
||||
future cross-op dependency tracking.
|
||||
|
||||
### D2. `OpLogger.records` — guaranteed `t_start` sort
|
||||
|
||||
```python
|
||||
@property
|
||||
def records(self) -> list[OpRecord]:
|
||||
self._records.sort(key=lambda r: r.t_start)
|
||||
return self._records
|
||||
```
|
||||
|
||||
A stable sort by `t_start` runs on each access. Records with the same
|
||||
`t_start` preserve insertion order. Aligns with ADR-0020 D5's
|
||||
"t_start stable ordering" requirement.
|
||||
|
||||
Phase 2 DataExecutor always accesses via the `records` property, so
|
||||
even when `record_end` calls arrive out of `t_start` order (e.g., a
|
||||
short op started later but finished earlier), the sequence handed to
|
||||
Phase 2 is consistent.
|
||||
|
||||
### D3. `params` schema per `op_name` (matrix from `_extract_op_info`)
|
||||
|
||||
#### D3.1. `op_kind="memory", op_name="dma_read"` (DmaReadCmd)
|
||||
|
||||
```python
|
||||
{"src_addr": int, "nbytes": int, "handle_id": str}
|
||||
```
|
||||
|
||||
#### D3.2. `op_kind="memory", op_name="dma_write"` (DmaWriteCmd)
|
||||
|
||||
```python
|
||||
{
|
||||
"src_space": str, # handle.space ("tcm"|"hbm"|"sram"), default "tcm"
|
||||
"src_addr": int, # handle.addr
|
||||
"shape": tuple, "dtype": str,
|
||||
"dst_space": "hbm", # DmaWrite always targets HBM
|
||||
"dst_addr": int,
|
||||
"nbytes": int,
|
||||
"handle_id": str,
|
||||
# When src_space == "hbm" at record_end, a snapshot is added (D4)
|
||||
"snapshot": np.ndarray | None,
|
||||
}
|
||||
```
|
||||
|
||||
#### D3.3. `op_kind="gemm", op_name=f"gemm_{dtype_a}"` (GemmCmd)
|
||||
|
||||
```python
|
||||
{
|
||||
"src_a_addr": int, "src_b_addr": int, "dst_addr": int,
|
||||
"shape_a": tuple, "shape_b": tuple, "shape_out": tuple,
|
||||
"dtype_in": str, "dtype_out": str,
|
||||
"m": int, "k": int, "n": int,
|
||||
# ADR-0027: per-operand + output spaces preserved
|
||||
"src_a_space": str, "src_b_space": str, "dst_space": str,
|
||||
}
|
||||
```
|
||||
|
||||
#### D3.4. `op_kind="math", op_name=msg.op` (MathCmd; op = "exp", "sum", "add", "where", …)
|
||||
|
||||
```python
|
||||
{
|
||||
"input_addrs": list[int], # addrs of input handles
|
||||
"input_shapes": list[tuple],
|
||||
"input_spaces": list[str],
|
||||
"input_dtypes": list[str],
|
||||
"dst_addr": int, "dst_space": str,
|
||||
"shape_out": tuple, "dtype": str,
|
||||
"axis": int | None, # only meaningful for reductions
|
||||
# All inputs get snapshots at record_end (D4)
|
||||
"input_snapshots": list[np.ndarray | None],
|
||||
}
|
||||
```
|
||||
|
||||
#### D3.5. `op_kind="gemm" or "math", op_name=f"composite_{op}"` (CompositeCmd)
|
||||
|
||||
```python
|
||||
{
|
||||
"op": str, # "gemm" | "math"
|
||||
"out_addr": int, "out_nbytes": int,
|
||||
# If op == "gemm", same fields as GemmCmd are added:
|
||||
"src_a_addr": int, "src_b_addr": int,
|
||||
"shape_a": tuple, "shape_b": tuple,
|
||||
"dtype_in": str, "dtype_out": str,
|
||||
"src_a_space": str, "src_b_space": str,
|
||||
"dst_space": "hbm", "dst_addr": int, # = out_addr
|
||||
}
|
||||
```
|
||||
|
||||
If `op == "gemm"`, `op_kind = "gemm"`; otherwise `"math"`. An alias so
|
||||
Phase 2 replays composite-gemm on the same path as `GemmCmd`.
|
||||
|
||||
#### D3.6. `op_kind="memory", op_name="ipcq_copy"` (record_copy path)
|
||||
|
||||
```python
|
||||
{
|
||||
"src_space": str, "src_addr": int,
|
||||
"dst_space": str, "dst_addr": int,
|
||||
"shape": tuple, "dtype": str, "nbytes": int,
|
||||
"snapshot": np.ndarray | None, # passed by caller; if None, record_copy reads fresh
|
||||
}
|
||||
```
|
||||
|
||||
`PE_DMA._handle_ipcq_inbound` (ADR-0023 D9) emits this record so Phase
|
||||
2 can replay the IPCQ slot's inbound copy. It bypasses
|
||||
`record_start` / `record_end` and pushes directly via `record_copy()`.
|
||||
|
||||
#### D3.7. `op_kind="unknown", op_name=type(msg).__name__`
|
||||
|
||||
Fallback for messages `_extract_op_info` doesn't recognize. `params =
|
||||
{}`. If DataExecutor encounters this kind, it skips — Phase 2 replay
|
||||
is unaffected.
|
||||
|
||||
### D4. Snapshot capture timing
|
||||
|
||||
When `OpLogger._memory_store` is set, `record_end` performs:
|
||||
|
||||
- **Math op**: read every input
|
||||
(addr/shape/space/dtype) from `self._memory_store.read(...)` and
|
||||
attach an ndarray copy to `params["input_snapshots"]`. Read failure
|
||||
→ `None`.
|
||||
- **`dma_write` op**: snapshot the source **only if `src_space ==
|
||||
"hbm"`** and attach to `params["snapshot"]`. TCM (PE scratch)
|
||||
sources are **deliberately skipped** — TCM is repopulated by Phase 2
|
||||
math/gemm replay, and a Phase-1-time snapshot would capture a
|
||||
previous kernel's stale value (ADR-0027 postmortem: TP gemm →
|
||||
all_reduce race).
|
||||
- **`ipcq_copy`**: the caller passes the in-flight snapshot via
|
||||
`snapshot=token.data`. If absent, `record_copy` attempts a fresh
|
||||
read from MemoryStore.
|
||||
|
||||
Snapshots are taken with `.copy()` (fresh allocation), making them
|
||||
safe against later storage mutation. This is the foundation of
|
||||
ADR-0027's "cross-PE Phase 2 ordering" race-avoidance.
|
||||
|
||||
When `memory_store` is `None` (Phase 1 timing-only mode), all
|
||||
snapshot steps are skipped. Only the timing portion of the record is
|
||||
preserved; data replay is unavailable.
|
||||
|
||||
### D5. TileToken handling — `record_start` captures stage info
|
||||
|
||||
ADR-0014 D6's self-routing tile token (pipeline mode) may have already
|
||||
advanced its `stage_idx` by the time `record_end` runs (the TileToken
|
||||
caches the next stage's params as it moves to the next component).
|
||||
Therefore:
|
||||
|
||||
`record_start` pre-saves the following in `pending[id(msg)]["snap"]`:
|
||||
|
||||
```python
|
||||
snap["stage_type"] = stage.stage_type.name # "DMA_READ", "GEMM", ...
|
||||
snap["stage_params"] = dict(stage.params) # copy of params at start time
|
||||
```
|
||||
|
||||
`record_end` retrieves this snap and merges into params:
|
||||
|
||||
- Adds `params["stage_type"]` to final params.
|
||||
- Merges `stage_params` keys (keeps existing values if any).
|
||||
- If `op_name == "TileToken"`, rewrites it to
|
||||
`f"TileToken/{stage_type}"` (e.g., `"TileToken/DMA_READ"`),
|
||||
disambiguating different stages emitted by the same component.
|
||||
|
||||
Thanks to this, DMA_READ vs DMA_WRITE, FETCH vs STORE coming from the
|
||||
same component (e.g., pe_dma) are distinguishable in reports.
|
||||
|
||||
### D6. `MemoryStore` — two-level (space, addr) dict
|
||||
|
||||
```python
|
||||
class MemoryStore:
|
||||
def __init__(self) -> None:
|
||||
self._storage: dict[str, dict[int, np.ndarray]] = {}
|
||||
|
||||
def write(self, space, addr, data): self._storage[space][addr] = data
|
||||
def read(self, space, addr, shape=None, dtype=None) -> np.ndarray: ...
|
||||
def has(self, space, addr) -> bool: ...
|
||||
def snapshot(self) -> MemoryStore: ...
|
||||
```
|
||||
|
||||
#### D6.1. Space namespace
|
||||
|
||||
A string key. Standard values:
|
||||
|
||||
- `"hbm"`: HBM data (deploy_tensor + Phase 2 dma_write results).
|
||||
- `"tcm"`: PE-local TCM (Phase 2 math/gemm output).
|
||||
- `"sram"`: cube-level SRAM (ADR-0023 D9.7's IPCQ slot tier).
|
||||
|
||||
Other spaces (e.g., `"reg"`) are allowed — `_storage` is a lazy dict
|
||||
that creates a new space when `write` first touches it.
|
||||
|
||||
#### D6.2. Address keying
|
||||
|
||||
`addr` is an integer. It may be a **physical address (PA) or a virtual
|
||||
address (VA)** — `MemoryStore` itself doesn't know address-space
|
||||
semantics; it just uses them as keys. Phase 1's `MemoryWriteMsg`
|
||||
writes both PA and VA
|
||||
(`_create_tensor` zero-inits at PA and at the VA base too); Phase 2
|
||||
reads/writes via the addresses captured by op_log.
|
||||
|
||||
The caller decides `addr`'s meaning — `MemoryStore` provides only
|
||||
lookup.
|
||||
|
||||
#### D6.3. read/write semantics — reference store (no copy)
|
||||
|
||||
`write(space, addr, data)`: stores the ndarray reference. **No copy.**
|
||||
If the caller later mutates the same ndarray, the stored value
|
||||
changes.
|
||||
|
||||
`read(space, addr, shape=None, dtype=None)`: returns the stored
|
||||
ndarray reference. If `shape`/`dtype` are provided:
|
||||
|
||||
- `dtype != stored.dtype`: `arr.view(np_dtype)` reinterprets as a
|
||||
view (no copy).
|
||||
- `shape != stored.shape`: if `nbytes` matches, `arr.reshape(shape)`
|
||||
is a view.
|
||||
- `nbytes` mismatch → `ValueError`.
|
||||
|
||||
To detach the data, the caller must call `arr.copy()`. ADR-0027's
|
||||
race-avoidance requires explicit `.copy()` in op_log snapshot steps
|
||||
for exactly this reason.
|
||||
|
||||
#### D6.4. `has(space, addr) -> bool`
|
||||
|
||||
Existence check; does not materialize data.
|
||||
|
||||
#### D6.5. `snapshot() -> MemoryStore`
|
||||
|
||||
Shallow copy. Creates a new instance of inner dicts but shares
|
||||
ndarray references. Used at Phase 2 init to fork from Phase 1's
|
||||
store, so Phase 2 mutations don't affect Phase 1's remaining
|
||||
consumers.
|
||||
|
||||
### D7. op_log assumes a single-threaded SimPy
|
||||
|
||||
`OpLogger`'s `_records` and `_pending` are lock-free. SimPy is
|
||||
single-threaded, so nothing else can intrude between `record_start`
|
||||
and `record_end` for the same message.
|
||||
|
||||
When multi-process kernbench (ADR-0047 D6) arrives, OpLogger must be
|
||||
split per process — one OpLogger instance cannot receive records from
|
||||
multiple processes.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. Externalize op_log to SQLite / parquet
|
||||
|
||||
Rejected (currently). The in-memory list minimizes Phase 1 → Phase 2
|
||||
hand-off latency. Externalization makes sense for long-running batch
|
||||
runs but adds overhead for the current single-run workload.
|
||||
|
||||
### A2. Capture snapshots at `record_start`
|
||||
|
||||
Rejected. At `record_start`, inputs are often not yet populated (e.g.,
|
||||
a math op's input is the output of a just-issued previous op).
|
||||
`record_end` is the correct point.
|
||||
|
||||
### A3. Per-component MemoryStore
|
||||
|
||||
Rejected. The (space, addr) key already disambiguates effectively, and
|
||||
splitting per component would complicate cross-PE IPCQ copy (ADR-0023
|
||||
D9), which needs access to both source and destination stores.
|
||||
|
||||
### A4. Explicit dependency edges in op_log
|
||||
|
||||
Partially adopted. The `dependency_ids` field exists on `OpRecord` but
|
||||
is currently unused (D1). Phase 2 DataExecutor orders via `t_start` +
|
||||
a secondary sort (memory ops before math at the same `t_start`). When
|
||||
an explicit dependency graph is required, this field is the home.
|
||||
Current ordering rules are sufficient, so it remains unused.
|
||||
|
||||
## Consequences
|
||||
|
||||
- ADR-0020's op_log / MemoryStore declarations are expanded into the
|
||||
concrete D1–D6 schemas, so writing/modifying Phase 2 DataExecutor
|
||||
doesn't need source-grep to learn field semantics.
|
||||
- D3's per-`op_name` params matrix makes adding new ops (e.g., a new
|
||||
reduction type) a question of branching in `_extract_op_info`.
|
||||
- D4's per-op snapshot policy (math = input snapshot, dma_write =
|
||||
HBM-only snapshot) is ADR-locked, so ADR-0027's race-avoidance
|
||||
decision won't silently regress on future refactors.
|
||||
- D6.3's reference-store semantics are explicit, putting mutation
|
||||
safety on the caller. ADR-0027's explicit `.copy()` pattern is
|
||||
justified.
|
||||
- D7's single-thread assumption is recorded, so multi-process
|
||||
kernbench (ADR-0047 D6's supersession candidate) will need OpLogger
|
||||
separation when introduced.
|
||||
@@ -0,0 +1,351 @@
|
||||
# ADR-0053: Topology Builder + Visualizer Algorithms
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
Pins down the key algorithmic choices of the topology compile and
|
||||
visualization pipeline jointly implemented by `topology/builder.py`,
|
||||
`topology/mesh_gen.py`, and `topology/visualizer.py` —
|
||||
placement-driven router attachment, mesh auto-layout, the source_hash
|
||||
cache, view projections, and SVG rendering. ADR-0006 defines the
|
||||
high-level intent of topology compilation (compiled topology, distance
|
||||
extraction, automatic diagram generation), but **which algorithms the
|
||||
builder actually uses** was only discoverable via source grep.
|
||||
|
||||
## First action
|
||||
|
||||
When `resolve_topology(path_str)` is called, four steps run in order:
|
||||
|
||||
1. **Path validation** (`builder.py::resolve_topology`):
|
||||
`Path(path_str).expanduser().resolve()`, existence check, file
|
||||
check. Failure → `FileNotFoundError` or `ValueError`.
|
||||
2. **YAML parsing** (`_read_spec`): `yaml.safe_load`. Parse errors
|
||||
yield a `ValueError` with line/column. Non-dict roots are
|
||||
rejected.
|
||||
3. **Auto-generate the mesh** (`mesh_gen.ensure_mesh_file`): create or
|
||||
reuse a `cube_mesh.yaml` next to the topology file. Cache hit on
|
||||
matching source_hash; miss triggers regeneration. This step decides
|
||||
the cube NoC's router grid and attachment information.
|
||||
4. **Compile the graph** (`_compile_graph`): system → IO chiplets →
|
||||
cubes → inter-cube edges → IO↔cube edges → system↔IO edges, then
|
||||
build four view projections (system, sip, cube, pe) and wrap into
|
||||
a `TopologyGraph`.
|
||||
|
||||
In short, **topology compilation's first act is "read topology.yaml as
|
||||
a dict, create/validate cube_mesh.yaml in the same directory, then
|
||||
build the flat graph + 4-view projection in system → sip → cube → pe
|
||||
order"**.
|
||||
|
||||
## Context
|
||||
|
||||
`topology/` package responsibilities:
|
||||
|
||||
- **builder.py** (1207 lines): turns topology.yaml into a
|
||||
`TopologyGraph` (nodes + edges + 4 view projections).
|
||||
- **mesh_gen.py** (305 lines): auto-decides the cube NoC's router
|
||||
grid and PE/UCIe/M_CPU/SRAM attachment positions and caches them in
|
||||
`cube_mesh.yaml`.
|
||||
- **visualizer.py** (887 lines): generates four SVG diagrams (system /
|
||||
sip / cube / pe) from a `TopologyGraph`.
|
||||
|
||||
ADR-0006 makes the high-level decision that "the result of topology
|
||||
compilation is the single source for distance metadata and diagram
|
||||
generation", but specific algorithms (e.g., placement-driven nearest-
|
||||
router attachment, the HBM exclusion zone, which fields in source_hash
|
||||
trigger regeneration) are not in any ADR.
|
||||
|
||||
In particular, these decisions are absent at ADR level:
|
||||
|
||||
- Why is mesh_gen cached in a separate file (`cube_mesh.yaml`)?
|
||||
- Which fields are in source_hash, and which changes force
|
||||
regeneration?
|
||||
- Why placement coordinates in mm rather than cube coordinates?
|
||||
- How are the HBM exclusion zone and UCIe N/S/E/W distribution
|
||||
decided inside the mesh?
|
||||
- What is the abstraction-level difference among the four view
|
||||
projections (system/sip/cube/pe)?
|
||||
|
||||
This ADR captures these decisions in one place.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Compile pipeline — six stages
|
||||
|
||||
`_compile_graph(spec)`:
|
||||
|
||||
1. **System nodes** (`_instantiate_system`): add system-level nodes
|
||||
like `fabric.switch0` and the host CPU.
|
||||
2. **Per-SIP loop** (`for sip_id in range(system.sips.count)`):
|
||||
- **IO chiplets** (`_instantiate_io_chiplets`): create pcie_ep /
|
||||
io_cpu / io_noc / io_ucie PHYs / conn nodes and their bidirectional
|
||||
internal edges.
|
||||
- **Cube instantiation** (`_instantiate_cube`): using
|
||||
cube_mesh.yaml's router grid, instantiate cube routers, PE
|
||||
sub-components (pe_cpu, pe_dma, pe_fetch_store, pe_gemm, pe_math,
|
||||
pe_mmu, pe_tcm, pe_scheduler, pe_ipcq), m_cpu, sram, hbm_ctrl,
|
||||
and their internal edges.
|
||||
- **Inter-cube edges** (`_add_inter_cube_edges`): the UCIe
|
||||
N/S/E/W mesh edges.
|
||||
- **IO ↔ cube edges** (`_add_io_to_cube_edges`): connect io_noc to
|
||||
each cube's edge UCIe phy.
|
||||
3. **Switch ↔ IO edges** (`_add_system_to_io_edges`): bidirectional
|
||||
edges between `fabric.switch0` and each SIP's `pcie_ep` (the
|
||||
cross-SIP IPCQ path of ADR-0038 D3 + ADR-0010).
|
||||
4. **Build four view projections**:
|
||||
- `_build_system_view(spec)` — Tray level: SIPs and the system
|
||||
switch.
|
||||
- `_build_sip_view(spec)` — inside one SIP: cube mesh + IO
|
||||
chiplet.
|
||||
- `_build_cube_view(spec)` — inside one cube: router grid + PE /
|
||||
M_CPU / SRAM / HBM_CTRL attachments.
|
||||
- `_build_pe_view(spec)` — inside one PE: nine sub-components +
|
||||
internal edges (pe_internal kind).
|
||||
5. **Return `TopologyGraph`**: `TopologyGraph(spec, nodes, edges,
|
||||
system_view, sip_view, cube_view, pe_view)`.
|
||||
|
||||
The six stages are **ordered for a reason**: only after cubes exist
|
||||
do inter-cube edges have valid src/dst, and IO chiplets must precede
|
||||
the IO ↔ cube edges that reference them. New node types must slot in
|
||||
the right spot.
|
||||
|
||||
### D2. `cube_mesh.yaml` — a separate file with a source_hash cache
|
||||
|
||||
`mesh_gen.ensure_mesh_file(cube_spec, mesh_path)`:
|
||||
|
||||
1. Compute `source_hash = _compute_source_hash(cube_spec)` from these
|
||||
input fields:
|
||||
- `geometry` (cube_mm.w/h …).
|
||||
- `pe_layout` (corners, pe_per_corner).
|
||||
- `ucie.n_connections`.
|
||||
- `memory_map.hbm_mapping_mode`.
|
||||
- `placement` (m_cpu/sram pos_mm).
|
||||
2. If `mesh_path` (= `cube_mesh.yaml` next to topology.yaml) exists
|
||||
and `existing.source_hash == source_hash`, reuse it (cache hit).
|
||||
3. Otherwise, generate a new mesh via
|
||||
`_generate_mesh(cube_spec, source_hash)` and write to yaml.
|
||||
|
||||
Caching as a separate file because:
|
||||
|
||||
- Mesh generation involves nontrivial PE/UCIe/router attachment math
|
||||
and is too expensive to redo every time.
|
||||
- Multiple runs with the same cube spec must guarantee an identical
|
||||
mesh.
|
||||
- The resulting mesh is itself an inspectable / debuggable artifact.
|
||||
|
||||
The five fields listed in source_hash are the ones that determine
|
||||
mesh shape; other changes (e.g., bandwidth, overhead_ns) do not
|
||||
trigger mesh regeneration.
|
||||
|
||||
### D3. Cube NoC mesh auto-layout
|
||||
|
||||
`_generate_mesh(cube_spec)`:
|
||||
|
||||
#### D3.1. Rows / columns
|
||||
|
||||
- `pe_positions = _corner_pe_positions(cube_w, cube_h)`: PE-center
|
||||
coordinates (mm) per corner (NW/NE/SW/SE). Hardcoded patterns like
|
||||
`(1.5, 1.5)` and `(cube_w-1.5, cube_h-1.5)`; with `pe_per_corner=2`,
|
||||
each corner has two PE positions.
|
||||
- `col_xs = _compute_col_positions(...)`: union of PE x-coordinates,
|
||||
plus relay columns inserted when any gap exceeds
|
||||
`max_spacing = 3.0 mm`.
|
||||
- `row_ys, rows_per_half = _compute_row_positions(cube_h,
|
||||
n_connections, pe_positions)`:
|
||||
- `n_conn = max(n_connections, 2)` (hot-path minimum).
|
||||
- `rows_per_half = ceil(n_conn / 2)`.
|
||||
- Top half + two HBM rows + bottom half. HBM sits at
|
||||
`(cube_h/2 - 1.5, cube_h/2 + 1.5)`. The gap between PE rows and
|
||||
HBM rows is `hbm_gap = 1.5 mm`.
|
||||
|
||||
#### D3.2. HBM exclusion zone
|
||||
|
||||
`hbm_row_start = rows_per_half`,
|
||||
`hbm_row_end = rows_per_half + 1`.
|
||||
`hbm_col_start = n_cols // 2 - 1`,
|
||||
`hbm_col_end = n_cols // 2`.
|
||||
|
||||
Router slots inside this (row, col) rectangle are marked `None` (no
|
||||
router). HBM controllers are added separately as
|
||||
`hbm_ctrl.pe{X}` nodes following ADR-0017 D9's per-PE partition
|
||||
pattern.
|
||||
|
||||
#### D3.3. PE attachment
|
||||
|
||||
Each corner's PEs map to a row:
|
||||
|
||||
- Top half: NW → row 0, NE → row 1 (top_corners index).
|
||||
- Bottom half: SW → row `hbm_row_end + 1`, SE → row
|
||||
`hbm_row_end + 2`.
|
||||
|
||||
Each PE's x-coordinate attaches to the nearest column's router
|
||||
(`min(range(n_cols), key=lambda c: abs(col_xs[c] - pe_x))`).
|
||||
Attachment items are `pe{pe_idx}.dma`, `pe{pe_idx}.cpu`,
|
||||
`pe{pe_idx}.hbm` (pushed into the router's attach list).
|
||||
|
||||
#### D3.4. M_CPU / SRAM attachment — nearest router by Euclidean distance
|
||||
|
||||
For `placement.m_cpu.pos_mm` (default `[1.5, 5.5]`) and
|
||||
`placement.sram.pos_mm` (default `[1.5, 8.5]`), find the router with
|
||||
the smallest Euclidean distance and append `"m_cpu"` / `"sram"` to
|
||||
its attach list.
|
||||
|
||||
#### D3.5. UCIe N/S/E/W distribution
|
||||
|
||||
`ucie_pe_rows = top_pe_rows + bot_pe_rows` (total
|
||||
`2 * rows_per_half`).
|
||||
|
||||
- UCIe-E: one PE row at a time, attach `ucie_e.c{i}` to the rightmost
|
||||
column's router.
|
||||
- UCIe-W: attach `ucie_w.c{i}` to the leftmost column's router (E's
|
||||
mirror).
|
||||
- UCIe-N/S: split PE columns into left and right halves; attach to
|
||||
the top row's / bottom row's matching columns.
|
||||
|
||||
Each UCIe connection is suffixed `c{i}`, distributing
|
||||
ucie_n_connections PHYs (ADR-0017 D5+).
|
||||
|
||||
### D4. Node naming convention — single ownership
|
||||
|
||||
builder.py creates nodes with the following naming convention (the
|
||||
single-owner principle from ADR-0051 D5):
|
||||
|
||||
- `fabric.switch0` — system-level switch.
|
||||
- `sip{S}.{io_id}.{pcie_ep|io_cpu|io_noc|io_ucie.{dir}|conn.{id}}` —
|
||||
IO chiplet.
|
||||
- `sip{S}.cube{C}.{m_cpu|sram|hbm_ctrl.pe{X}|noc.r{R}c{C}|...}` —
|
||||
inside cube.
|
||||
- `sip{S}.cube{C}.pe{P}.{pe_cpu|pe_dma|pe_fetch_store|pe_gemm|pe_math|pe_mmu|pe_tcm|pe_scheduler|pe_ipcq}` —
|
||||
PE sub-components.
|
||||
|
||||
Changing this convention requires updating both builder.py and
|
||||
router.py's helpers (ADR-0051). Components never know the convention
|
||||
directly — they only call the helpers.
|
||||
|
||||
### D5. Edge `kind` classification
|
||||
|
||||
Every edge gets a `kind`; routing policy (ADR-0051 D2) reads it. Major
|
||||
kinds:
|
||||
|
||||
- `"pe_internal"` — within a PE between sub-components.
|
||||
- `"pe_to_router"` — PE_DMA ↔ cube NoC router.
|
||||
- `"router_mesh"` — between cube NoC routers.
|
||||
- `"router_to_hbm"`, `"router_to_mcpu"`, `"router_to_sram"`,
|
||||
`"sram_to_router"`, etc. — between cube-attached components.
|
||||
- `"ucie_internal"`, `"ucie_conn_to_router"`,
|
||||
`"router_to_ucie_conn"`, `"ucie_conn_to_noc"`,
|
||||
`"noc_to_ucie_conn"`, `"ucie_mesh"` — UCIe-related.
|
||||
- `"io_internal"` — inside IO chiplet.
|
||||
- `"io_to_cube"`, `"cube_to_io"` — at the IO ↔ cube boundary.
|
||||
- `"pcie"` — switch ↔ pcie_ep.
|
||||
- `"command"` — control-plane edges only (e.g., M_CPU ↔ NOC; excluded
|
||||
from PE DMA paths).
|
||||
|
||||
Adding a new edge kind requires picking a category in router.py's
|
||||
four adjacency graphs (ADR-0051 D2). If you forget, it defaults to
|
||||
`_adj_all` only, which can produce unintended routes.
|
||||
|
||||
### D6. View projection — four abstraction levels
|
||||
|
||||
`TopologyGraph` keeps four view projections alongside the flat
|
||||
nodes+edges:
|
||||
|
||||
- **system_view** (`_build_system_view`): Tray level. SIP blocks and
|
||||
`fabric.switch0`. PCIe links shown. For external high-level
|
||||
overview.
|
||||
- **sip_view** (`_build_sip_view`): inside one SIP — cube mesh + IO
|
||||
chiplet (pcie_ep + io_cpu + io_noc). UCIe N/S/E/W appear as
|
||||
cube-cube links.
|
||||
- **cube_view** (`_build_cube_view`): inside one cube — router grid +
|
||||
PE / M_CPU / SRAM / HBM_CTRL attachments + UCIe PHY edges. For
|
||||
intra-cube routing / placement debugging.
|
||||
- **pe_view** (`_build_pe_view`): inside one PE — nine sub-components
|
||||
+ internal edges (pe_internal kind). For detailed PE-internal
|
||||
dataflow review.
|
||||
|
||||
Views are selectively rendered via the spec's
|
||||
`visualization.emit_views: [system, sip, cube]` (ADR-0006). The pe
|
||||
view is omitted from default output but the code is retained for
|
||||
detailed debugging.
|
||||
|
||||
### D7. visualizer.py — SVG diagram output
|
||||
|
||||
`emit_diagrams(graph, out_dir)` renders every view as SVG. Key
|
||||
functions:
|
||||
|
||||
- `_render_view_svg(view)` — generic view render (no router grid).
|
||||
- `_render_cube_view_svg(view, spec)` — cube-view specific (HBM block,
|
||||
router grid layout, PE/M_CPU/SRAM/HBM placement).
|
||||
- `_draw_node`, `_draw_edge` — node/edge visual representation.
|
||||
- `_pick_scale`, `_compute_node_sizes` — auto-scaling.
|
||||
|
||||
The visualizer is a **derived artifact** (ADR-0006); changes here do
|
||||
not pass production checks. Aligns with CLAUDE.md's "Derived
|
||||
Artifacts" guidance.
|
||||
|
||||
### D8. Blast radius of spec changes
|
||||
|
||||
| spec field | effect | mesh regenerated? |
|
||||
|---------------------------------------|---------------------|-------------------|
|
||||
| `system.sips.count` | SIP count, node count | No |
|
||||
| `sip.cube_mesh.w/h` | cube mesh shape | No |
|
||||
| `cube.geometry.cube_mm.w/h` | cube size (mm) | **Yes** |
|
||||
| `cube.pe_layout.corners/pe_per_corner`| PE attachment positions | **Yes** |
|
||||
| `cube.ucie.n_connections` | UCIe PHY distribution | **Yes** |
|
||||
| `cube.memory_map.hbm_mapping_mode` | HBM distribution mode | **Yes** |
|
||||
| `cube.placement` | M_CPU/SRAM positions | **Yes** |
|
||||
| `cube.memory_map.*` (besides above) | HBM capacity / BW | No |
|
||||
| `*.links.*.bw_gbs` | edge bandwidth | No |
|
||||
| `*.attrs.overhead_ns` | component latency | No |
|
||||
|
||||
The table mirrors D2's `_compute_source_hash` inputs. Changes that
|
||||
require mesh regeneration automatically invalidate `cube_mesh.yaml`'s
|
||||
source_hash.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. Regenerate the mesh on every compile without a cache file
|
||||
|
||||
Rejected. The cost of mesh generation would be paid repeatedly (CLI
|
||||
runs, probe, tests) for the same spec, and the human-inspectable
|
||||
artifact would disappear.
|
||||
|
||||
### A2. Merge mesh generation into builder.py
|
||||
|
||||
Rejected (currently). It is a 305-line algorithm of its own, and the
|
||||
mesh-layout decisions (placement-driven router attachment, HBM
|
||||
exclusion zone) are different from builder's general node/edge
|
||||
emission. Keeping it separate respects single-responsibility.
|
||||
|
||||
### A3. Express placement coordinates in cube coordinates (col/row)
|
||||
|
||||
Rejected. mm coordinates flow consistently between the visualizer and
|
||||
mesh layout (for nearest-router computation). Cube coordinates are
|
||||
undefined until the router grid is fixed, so they are unsuitable as
|
||||
placement input.
|
||||
|
||||
### A4. Lazy view projection generation
|
||||
|
||||
Rejected (currently). The four views are cheap to build (typically <
|
||||
100 ms), and eager construction guarantees `TopologyGraph` as the
|
||||
single source of truth.
|
||||
|
||||
### A5. Visualizer output in formats besides SVG (PNG/PDF)
|
||||
|
||||
Rejected. SVG is vector + text-searchable + directly renderable in
|
||||
browsers. PNG conversion, when required, is downstream
|
||||
post-processing (e.g., rsvg-convert).
|
||||
|
||||
## Consequences
|
||||
|
||||
- ADR-0006's high-level intent is fleshed out via D1–D7; topology
|
||||
changes can be assessed quickly via D8's table.
|
||||
- D3's mesh-layout algorithm is ADR-locked, so future PE attachment
|
||||
patterns (e.g., a 6-zone HBM split) make clear which stage they
|
||||
affect.
|
||||
- D5's edge-kind list and D7's view structure are explicit, giving PR
|
||||
reviewers a quick map of where (builder + router + visualizer) a
|
||||
new component type ripples through.
|
||||
- D2's source_hash invalidation rules are explicit, so a stale
|
||||
`cube_mesh.yaml` (e.g., when only bandwidth changed) is recognized
|
||||
as correct behavior.
|
||||
@@ -0,0 +1,143 @@
|
||||
# ADR-0054: Milestone Eval Benches — self-contained sweep + figure benches
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
Amends ADR-0044 (D1/D2) and ADR-0045 (D5) and supersedes the "logic lives
|
||||
in `scripts/` + `tests/`" arrangement of ADR-0043/0044: the GEMM and
|
||||
allreduce evaluation harnesses are now self-contained **benches** that a
|
||||
user runs to regenerate every result + figure.
|
||||
|
||||
## Context
|
||||
|
||||
ADR-0043 (allreduce eval) and ADR-0044 (GEMM eval) split each harness into
|
||||
a **sweep** (a manual `scripts/` driver, or — for allreduce — the
|
||||
parametrized tests themselves) plus **figure tests** that render committed
|
||||
data. The sweep/render logic therefore lived under `scripts/gemm_sweep.py`,
|
||||
`tests/gemm/_gemm_plot_helpers.py`, and `tests/sccl/_allreduce_helpers.py`.
|
||||
|
||||
A milestone requirement ("refactor allreduce + GEMM evaluation so a user
|
||||
can run *one bench* to generate all the results and plots") cannot be met
|
||||
by that layout: a bench is production code and **must not import from
|
||||
`tests/`** (ADR-0007 layer direction). The eval logic had to move into
|
||||
production, reachable from a bench.
|
||||
|
||||
The chosen home is the bench module itself — not a separate
|
||||
`kernbench.eval` package. A bench file may contain arbitrary module-level
|
||||
code; collapsing the harness into the bench keeps one file per domain and
|
||||
avoids an extra package layer.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Two milestone benches own the eval logic
|
||||
|
||||
- `src/kernbench/benches/milestone_1h_gemm.py` — GEMM shape×variant sweep +
|
||||
the three figure renderers (moved from `scripts/gemm_sweep.py` +
|
||||
`tests/gemm/_gemm_plot_helpers.py`).
|
||||
- `src/kernbench/benches/milestone_1h_ccl.py` — the distributed allreduce
|
||||
driver, latency + buffer-kind sweeps, topology diagram, FSIM comparison,
|
||||
and the direct-launch parity reference (moved from
|
||||
`tests/sccl/_allreduce_helpers.py`).
|
||||
|
||||
Each file is the **single home** for its domain's eval logic.
|
||||
|
||||
### D2. The "eval bench" pattern (extends ADR-0045 D5)
|
||||
|
||||
ADR-0045 D5 fixed a bench to a single configuration (single-SIP, or the
|
||||
ADR-0024 multi-SIP CCL exception). This ADR adds a third pattern:
|
||||
|
||||
- An **eval bench** may drive *many* configurations and render figures. It
|
||||
builds its own per-config `GraphEngine` / `RuntimeContext` instances
|
||||
(one per sweep point) rather than using the outer `run_bench` engine.
|
||||
- Because the outer ctx then has no submitted handles, the bench submits a
|
||||
**sentinel tensor** (`torch.zeros((1, 1), …)`) at the end to satisfy
|
||||
`run_bench`'s "must submit at least one request" contract (ADR-0045 D4),
|
||||
so the CLI exits 0.
|
||||
|
||||
### D3. Output location
|
||||
|
||||
Both benches write to `src/kernbench/benches/1H_milestone_output/{gemm,ccl}/`
|
||||
(per user request — artifacts beside the bench). The directory holds only
|
||||
generated PNG/CSV/JSON (never a `.py`/`__init__.py`), so the eager-import
|
||||
audit (ADR-0045 first action) ignores it — `pkgutil.iter_modules` does not
|
||||
yield non-package subdirectories. It is **committed** (like the
|
||||
`docs/diagrams/` artifacts) so the figures are viewable on the remote;
|
||||
rerunning the bench regenerates it in place.
|
||||
|
||||
### D4. GEMM heavy sweep — fresh by default, `MILESTONE_FAST` to reuse
|
||||
|
||||
`milestone-1h-gemm` runs the full 24-sim sweep by default (minutes; one
|
||||
shape is 2048 tiles). `MILESTONE_FAST=1` reuses the committed
|
||||
`docs/diagrams/gemm_sweep.json` and only re-renders (seconds). This
|
||||
reverses ADR-0044 D1/D2's "heavy sweep stays a manual/`slow`-marked step":
|
||||
running the bench *is* the regeneration. The slow path is exercised by a
|
||||
`@pytest.mark.slow` bench test; the fast path runs by default.
|
||||
|
||||
### D5. Tests + script reuse via thin re-export shims (single home kept)
|
||||
|
||||
The pre-existing figure tests and the `scripts/gemm_sweep.py` entry point
|
||||
are retained and now reuse the bench modules:
|
||||
|
||||
- `tests/gemm/_gemm_plot_helpers.py` → re-exports the renderers +
|
||||
`GEMM_SWEEP_JSON`/`GEMM_PLOTS_DIR`/`ROOT` from
|
||||
`kernbench.benches.milestone_1h_gemm`.
|
||||
- `tests/sccl/_allreduce_helpers.py` → re-exports the driver core, config
|
||||
writers, sweep constants, renderers, and disk aggregators from
|
||||
`kernbench.benches.milestone_1h_ccl`, and keeps the **pytest-only** pieces
|
||||
local: the `pytest.param` matrices (`CONFIGS` / `_sweep_params` /
|
||||
`_bk_params`) and the fixture-coupled `_run_distributed`
|
||||
(`monkeypatch.chdir` + `_drive_distributed`) wrapper.
|
||||
- `scripts/gemm_sweep.py` → thin wrapper over the bench's `run_sweep`.
|
||||
|
||||
Tests importing a bench module is permitted (tests sit above production,
|
||||
ADR-0007); it triggers the whole-package eager audit, which already runs on
|
||||
every `kernbench` invocation. matplotlib stays lazily imported inside the
|
||||
renderers, so the audit's startup cost is unchanged.
|
||||
|
||||
### D6. Flat module naming (no `benches/` subfolder)
|
||||
|
||||
A `benches/` subpackage named `1H_milestone…` is impossible — a Python
|
||||
package name cannot start with a digit. The benches are therefore flat
|
||||
modules `milestone_1h_gemm.py` / `milestone_1h_ccl.py` with bench names
|
||||
`milestone-1h-gemm` / `milestone-1h-ccl` (kebab-case, letter-first per
|
||||
ADR-0045 D1).
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- `kernbench run --bench milestone-1h-gemm` (or `…-ccl`) regenerates all of
|
||||
a domain's results + figures in one command — the milestone requirement.
|
||||
- Single source for the eval logic (the bench), reused by tests and the
|
||||
script via shims; no duplication.
|
||||
- The figure tests and `scripts/gemm_sweep.py` keep working unchanged.
|
||||
|
||||
### Negative / limitations
|
||||
|
||||
- The two bench files are large (the CCL one mixes the distributed driver,
|
||||
sweeps, and matplotlib drawing). A "bench" that is mostly an eval harness
|
||||
is unusual; this ADR legitimizes it.
|
||||
- Generated artifacts live inside the source tree (`src/kernbench/benches/`)
|
||||
by explicit request and are committed (so the figures are viewable on the
|
||||
remote); rerunning the bench regenerates them.
|
||||
- `milestone-1h-ccl` (and the default `milestone-1h-gemm`) take minutes —
|
||||
acceptable for an on-demand milestone artifact, not for routine runs.
|
||||
|
||||
## Dependencies
|
||||
|
||||
- **ADR-0007**: layer direction (why tests may import production but a bench
|
||||
may not import tests).
|
||||
- **ADR-0043 / ADR-0044**: the allreduce / GEMM eval harnesses this ADR
|
||||
relocates into benches.
|
||||
- **ADR-0045**: bench module contract; D2 here extends its D5 (single-device
|
||||
rule) with the eval-bench pattern, and relies on D4 (NO_REQUESTS) for the
|
||||
sentinel.
|
||||
- **ADR-0024**: rank = SIP launcher driven by the allreduce sweeps.
|
||||
|
||||
## Open questions
|
||||
|
||||
- Should the GEMM theoretical-model constants (ADR-0044 D5) be sourced from
|
||||
ADR-0033/0014 rather than copied? Unchanged by this ADR.
|
||||
- Should `build_overview_slides.py` consume the milestone output PNGs
|
||||
instead of drawing GEMM bars natively? Still open (ADR-0044 D6 / Negative).
|
||||
@@ -0,0 +1,175 @@
|
||||
# ADR Index
|
||||
|
||||
Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **47**.
|
||||
|
||||
Classification mirrors the `/report` skill's section assignment. When adding a new ADR, also add an entry to the `CLASSIFICATION` table in `tools/generate_adr_index.py`.
|
||||
|
||||
## Design Principles
|
||||
|
||||
- [ADR-0013](./ADR-0013-ver-verification-strategy.md) — Verification Strategy and Phase 1 Test Plan
|
||||
- [ADR-0033](./ADR-0033-lat-latency-model-assumptions.md) — Latency Model: Assumptions and Known Simplifications
|
||||
|
||||
## High-level Architecture
|
||||
|
||||
- [ADR-0003](./ADR-0003-dev-target-system-hierarchy.md) — Target System Hierarchy & Modeling Scope _(System hierarchy (Tray / SIP / CUBE / PE))_
|
||||
- [ADR-0007](./ADR-0007-api-runtime-api-boundaries.md) — Runtime API and Simulation Engine Boundaries _(Runtime API ↔ sim_engine boundaries)_
|
||||
- [ADR-0016](./ADR-0016-dev-iochiplet-noc-and-memory-path.md) — IOChiplet NOC and Memory Data Path _(IOChiplet NOC and memory data path)_
|
||||
- [ADR-0017](./ADR-0017-dev-cube-noc-and-hbm-connectivity.md) — Cube NOC and HBM Connectivity _(Cube NOC and HBM connectivity)_
|
||||
|
||||
## Detailed Architecture
|
||||
|
||||
One subsection per component file under `src/kernbench/components/builtin/`.
|
||||
|
||||
### forwarding
|
||||
|
||||
- [ADR-0037](./ADR-0037-dev-forwarding-component.md) — Forwarding Component (forwarding_v1)
|
||||
|
||||
### hbm_ctrl
|
||||
|
||||
- [ADR-0034](./ADR-0034-dev-hbm-controller-internal-design.md) — HBM Controller Internal Design
|
||||
|
||||
### io_cpu
|
||||
|
||||
- [ADR-0036](./ADR-0036-dev-io-cpu-component-model.md) — IO_CPU Component Model
|
||||
|
||||
### m_cpu
|
||||
|
||||
- [ADR-0035](./ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md) — M_CPU and M_CPU.DMA Component Model
|
||||
|
||||
### pcie_ep
|
||||
|
||||
- [ADR-0038](./ADR-0038-dev-pcie-ep-component-model.md) — PCIE_EP Component Model
|
||||
|
||||
### pe_cpu
|
||||
|
||||
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
|
||||
|
||||
### pe_dma
|
||||
|
||||
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
|
||||
- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
|
||||
|
||||
### pe_fetch_store
|
||||
|
||||
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
|
||||
|
||||
### pe_gemm
|
||||
|
||||
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
|
||||
|
||||
### pe_ipcq
|
||||
|
||||
- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
|
||||
|
||||
### pe_math
|
||||
|
||||
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
|
||||
|
||||
### pe_mmu
|
||||
|
||||
- [ADR-0039](./ADR-0039-dev-pe-mmu-component-model.md) — PE_MMU Component Model — Component + Utility Dual Role
|
||||
|
||||
### pe_scheduler
|
||||
|
||||
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
|
||||
|
||||
### pe_tcm
|
||||
|
||||
- [ADR-0040](./ADR-0040-dev-pe-tcm-component-model.md) — PE_TCM Component Model — Dual-Channel BW Serialization
|
||||
|
||||
### sram
|
||||
|
||||
- [ADR-0041](./ADR-0041-dev-cube-sram-component-model.md) — Cube SRAM Component Model — terminal scratchpad on cube NoC
|
||||
|
||||
### tiling
|
||||
|
||||
- [ADR-0042](./ADR-0042-prog-tile-plan-generators.md) — Tile Plan Generators — GEMM/Math Pipeline Plan Builders
|
||||
|
||||
## Implementation Decisions
|
||||
|
||||
### Address Scheme
|
||||
|
||||
- [ADR-0001](./ADR-0001-mem-physaddr-layout.md) — 51-bit Physical Address Layout & Decoding Contract
|
||||
- [ADR-0011](./ADR-0011-mem-memory-addressing-simplification.md) — Memory Addressing — PA / VA / LA Address Models
|
||||
|
||||
### Routing & Helper API
|
||||
|
||||
- [ADR-0002](./ADR-0002-lat-routing-distance.md) — Routing Distance, Ordering & Bypass Rules
|
||||
- [ADR-0051](./ADR-0051-lat-routing-helper-api.md) — Routing Helper API — `AddressResolver` + `PathRouter`
|
||||
|
||||
### Memory Semantics & Local-HBM Bandwidth
|
||||
|
||||
- [ADR-0004](./ADR-0004-mem-memory-semantics-local-hbm.md) — Memory Semantics & Local-HBM Bandwidth Guarantee
|
||||
|
||||
### Topology Compilation, Diagrams & Builder Algorithms
|
||||
|
||||
- [ADR-0005](./ADR-0005-dev-diagram-views-distance-layout.md) — Diagram Views & Distance-Aware Layout Rules
|
||||
- [ADR-0006](./ADR-0006-dev-topology-compilation-distance-diagram.md) — Topology Compilation, Distance Extraction, and Automatic Diagram Generation
|
||||
- [ADR-0053](./ADR-0053-dev-topology-builder-algorithms.md) — Topology Builder + Visualizer Algorithms
|
||||
|
||||
### Tensor Deployment and Allocation
|
||||
|
||||
- [ADR-0008](./ADR-0008-api-tensor-deploy-and-allocation.md) — Tensor Deployment and Allocation (Host Allocator, PA-first)
|
||||
|
||||
### Kernel Execution and Host-Device Messaging
|
||||
|
||||
- [ADR-0009](./ADR-0009-api-kernel-execution-messaging.md) — Kernel Execution Messaging and Completion Semantics
|
||||
- [ADR-0012](./ADR-0012-api-host-io-message-schema.md) — Host ↔ IO_CPU Message Schema (PA-first, PE-tagged)
|
||||
|
||||
### CLI Surface and Semantics
|
||||
|
||||
- [ADR-0010](./ADR-0010-api-cli-surface-and-semantics.md) — Command Line Interface and Execution Semantics
|
||||
|
||||
### Component Port/Wire Fabric Model
|
||||
|
||||
- [ADR-0015](./ADR-0015-dev-component-port-wire-model.md) — Component Port/Wire Model and Fabric Routing
|
||||
|
||||
### Two-Pass Data Execution
|
||||
|
||||
- [ADR-0020](./ADR-0020-prog-data-execution-two-pass.md) — 2-Pass Data Execution Model (Timing / Data Separation)
|
||||
|
||||
### 2D Grid Program Identity
|
||||
|
||||
- [ADR-0022](./ADR-0022-prog-program-id-2d-grid.md) — 2D Grid program_id Semantics
|
||||
|
||||
### Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)
|
||||
|
||||
- [ADR-0024](./ADR-0024-par-sip-tp-launcher.md) — SIP-level Launcher — rank = SIP
|
||||
- [ADR-0026](./ADR-0026-par-dppolicy-intra-device.md) — DPPolicy = Intra-Device Only — remove sip/num_sips fields
|
||||
- [ADR-0027](./ADR-0027-par-megatron-tp.md) — Megatron-style Tensor Parallelism API
|
||||
- [ADR-0047](./ADR-0047-par-ahbm-ccl-backend.md) — AHBM CCL Backend — `torch.distributed`-compat shim
|
||||
- [ADR-0050](./ADR-0050-par-ccl-algorithm-module-contract.md) — CCL Algorithm Module Contract — `ccl/algorithms/*.py`
|
||||
|
||||
### IPCQ Direction Addressing
|
||||
|
||||
- [ADR-0025](./ADR-0025-algo-ipcq-direction-addressing.md) — IPCQ Direction Addressing — address-based matching
|
||||
|
||||
### Intercube All-Reduce
|
||||
|
||||
- [ADR-0032](./ADR-0032-algo-intercube-allreduce.md) — Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange
|
||||
|
||||
### Evaluation Harnesses
|
||||
|
||||
- [ADR-0043](./ADR-0043-eval-allreduce-harness.md) — Allreduce Evaluation Harness — `tests/sccl/`
|
||||
- [ADR-0044](./ADR-0044-eval-gemm-harness.md) — GEMM Evaluation Harness — `scripts/gemm_sweep.py` + `tests/gemm/`
|
||||
- [ADR-0054](./ADR-0054-eval-milestone-benches.md) — Milestone Eval Benches — self-contained sweep + figure benches
|
||||
|
||||
### Bench Module Contract
|
||||
|
||||
- [ADR-0045](./ADR-0045-prog-bench-module-contract.md) — Bench Module Contract — registration, dispatch, and authoring
|
||||
|
||||
### Kernel-side tl.* API (TLContext)
|
||||
|
||||
- [ADR-0046](./ADR-0046-prog-tl-context-contract.md) — TLContext — Kernel-side `tl.*` API Contract
|
||||
|
||||
### Memory Allocator Algorithms
|
||||
|
||||
- [ADR-0048](./ADR-0048-mem-allocator-algorithms.md) — Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
|
||||
|
||||
### Probe Subcommand
|
||||
|
||||
- [ADR-0049](./ADR-0049-ver-probe-subcommand.md) — `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
|
||||
|
||||
### Sim-engine Op Log and Memory Store Schemas
|
||||
|
||||
- [ADR-0052](./ADR-0052-dev-oplog-memory-store-schemas.md) — OpLog + MemoryStore Schemas — sim_engine internals
|
||||
@@ -0,0 +1,836 @@
|
||||
# KernBench — Architecture Design Document
|
||||
*2026 1H*
|
||||
|
||||
KernBench is a system-level, discrete-event simulator for AI-accelerator
|
||||
chiplet systems. It models the data-movement and control paths across
|
||||
the full hardware hierarchy and reports end-to-end execution latency
|
||||
for kernels dispatched to the device's compute units.
|
||||
|
||||
This document is a public summary of the architecture as designed and
|
||||
implemented in the first half of 2026. It assumes no prior knowledge of
|
||||
the simulator's internal documents; terms specific to the system are
|
||||
defined on first use.
|
||||
|
||||
---
|
||||
|
||||
## Design Principles
|
||||
|
||||
KernBench is grounded in two foundational commitments: every measured
|
||||
latency must trace to explicit, modeled events on the simulator's graph,
|
||||
and every behavioral claim must be verifiable through tests that target
|
||||
spec-level invariants rather than incidental implementation details.
|
||||
|
||||
<!-- src: ADR-0013 Context, Decision -->
|
||||
The verification posture is verification-driven. Tests are written to
|
||||
validate the architectural contracts that the simulator exposes —
|
||||
correct routing, deterministic results, monotonic latency under
|
||||
increasing hop counts — rather than to mirror the call graph of the
|
||||
implementation. Two phases coexist: a fast timing phase that exercises
|
||||
the simulator's discrete-event engine and produces a log of operations
|
||||
with timestamps, and an optional data-replay phase that uses that log
|
||||
to compute real numerical results. Tests can target either phase.
|
||||
|
||||
<!-- src: ADR-0033 Context, Decision -->
|
||||
The latency model is intentionally abstract rather than
|
||||
cycle-accurate. Each modeled node contributes a configurable per-node
|
||||
overhead, each link contributes wire delay plus byte-over-bandwidth
|
||||
serialization, and each terminal service contributes its own service
|
||||
time. The simulator does not attempt to reproduce cache coherence
|
||||
protocols, microarchitectural pipelines, or full PCIe/UCIe protocol
|
||||
correctness; those are explicitly outside the scope. The aim is a
|
||||
simulator that compares system-level configurations meaningfully and
|
||||
deterministically, not one that ships microarchitectural truths.
|
||||
|
||||
<!-- src: ADR-0033 Decision, Consequences -->
|
||||
Determinism is a hard requirement. Given identical inputs — topology,
|
||||
routing policy, and request stream — the simulator must produce
|
||||
identical outputs, hop traces included. This rules out reliance on
|
||||
unordered set iteration on the critical path and forces every latency
|
||||
contribution to come from an explicitly scheduled event on a modeled
|
||||
component or link. There are no implicit waits, no hardcoded magic
|
||||
delays, and no shortcuts that bypass the modeled graph.
|
||||
|
||||
---
|
||||
|
||||
## High-level Architecture
|
||||
|
||||
<!-- src: ADR-0003 Context, Decision -->
|
||||
The simulated system is a four-level hierarchy. A **Tray** holds one or
|
||||
more **SIPs** (system-in-package), each containing a 2D mesh of
|
||||
**CUBEs** plus one or more **IO chiplets** that connect the SIP to the
|
||||
host. Each CUBE contains a regular grid of **PEs** (processing
|
||||
elements) plus its own attached resources — high-bandwidth memory
|
||||
(HBM), a per-cube SRAM scratchpad, and a management CPU (M_CPU). The PE
|
||||
itself is a composite of nine sub-components rather than a monolithic
|
||||
core. This hierarchy is fixed; the parameters along each axis (counts,
|
||||
mesh dimensions, link widths) are configurable through the topology
|
||||
spec.
|
||||
|
||||
<!-- src: ADR-0007 Context, Decision -->
|
||||
A clean separation runs along the request flow. A **runtime API** at
|
||||
the top is the host-facing surface; it exposes tensor and kernel
|
||||
operations, owns host-side allocation metadata, and is topology-
|
||||
agnostic — it does not route or fan out. Below it the **simulation
|
||||
engine** decomposes runtime operations into discrete graph requests
|
||||
(memory writes, memory reads, kernel launches, MMU map installs) and
|
||||
schedules events deterministically. At the bottom, **components** model
|
||||
device behavior on a graph of nodes connected by links; they
|
||||
implement the actual latency contributions and pass requests along.
|
||||
No component reaches up into the runtime API, and no runtime call
|
||||
shortcuts the engine.
|
||||
|
||||
<!-- DIAGRAM: Four-level system hierarchy — Tray containing SIPs, each SIP showing its 2D cube mesh and IO chiplet; one cube blown up to show the router mesh, attached PEs, M_CPU, SRAM, and HBM partition. -->
|
||||
|
||||
### Tray
|
||||
|
||||
<!-- src: ADR-0003 Decision -->
|
||||
The Tray is the outermost boundary. It owns the host CPU on one side
|
||||
and one or more SIPs on the other, connected through a fabric switch.
|
||||
For collective communication that must traverse multiple SIPs, the
|
||||
fabric switch acts as the common rendezvous: device-side outbound
|
||||
traffic from one SIP routes through the switch and back into the
|
||||
target SIP's IO chiplet.
|
||||
|
||||
### SIP
|
||||
|
||||
<!-- src: ADR-0003 Decision, ADR-0017 Context -->
|
||||
A SIP packages a 2D mesh of CUBEs and one or more IO chiplets. The
|
||||
default topology used by the simulator is a 4×4 cube mesh; the
|
||||
mesh dimensions are configurable. Each cube on the boundary of the
|
||||
mesh connects to its neighbors over UCIe (die-to-die) links arranged
|
||||
on the four cardinal sides — north, south, east, and west. The IO
|
||||
chiplets sit on one side of the SIP and provide the bridge to the host
|
||||
across PCIe.
|
||||
|
||||
<!-- src: ADR-0016 Context, Decision -->
|
||||
The IO chiplet itself contains its own internal network. A
|
||||
host-facing PCIe endpoint passes traffic to a small NOC ("network on
|
||||
chip"); from there it can branch to a control-plane CPU that processes
|
||||
kernel-launch messages, or it can take the direct memory data path to
|
||||
the cube's HBM controller. The decision to provide a direct memory
|
||||
path that bypasses the control CPU was a deliberate concession to
|
||||
keep host-issued memory writes from paying control-plane overhead on
|
||||
the data path.
|
||||
|
||||
### CUBE
|
||||
|
||||
<!-- src: ADR-0017 Decision -->
|
||||
Each CUBE owns a 2D mesh of NOC routers and a set of attached
|
||||
resources: PEs, the cube-local SRAM scratchpad, the management CPU
|
||||
(M_CPU), and the HBM partition (split across multiple PE-private
|
||||
slices for bandwidth). The router mesh uses deterministic XY routing.
|
||||
Attached components do not connect to each other directly — they all
|
||||
sit on the router mesh, and every cube-internal transfer pays the
|
||||
mesh distance from source to destination.
|
||||
|
||||
<!-- src: ADR-0017 Decision -->
|
||||
The HBM partition is per-PE: each PE owns one HBM slice, and the
|
||||
controller exposes per-PE channels so that the same PE always
|
||||
addresses the same set of HBM channels. This makes the local-HBM
|
||||
bandwidth from a PE to its own slice predictable, while accesses to
|
||||
another PE's slice — or a different cube's slice — pay the mesh
|
||||
distance and any UCIe crossings.
|
||||
|
||||
### PE
|
||||
|
||||
<!-- src: ADR-0014 Context, Decision -->
|
||||
A PE is not a monolithic core. Internally it is a set of nine
|
||||
sub-components, each modeling one stage of a request's flow: a small
|
||||
control CPU, a tile-pipeline scheduler, a DMA engine, a fetch-store
|
||||
engine that moves data between the on-PE scratchpad and the register
|
||||
file, a GEMM compute engine, a math compute engine, the tightly-
|
||||
coupled memory (TCM, the on-PE scratchpad), an MMU for virtual-to-
|
||||
physical address translation, and an inter-PE collective queue
|
||||
(IPCQ). The scheduler decomposes higher-level operations into per-tile
|
||||
stage sequences, and tile tokens self-route from one sub-component
|
||||
to the next.
|
||||
|
||||
<!-- DIAGRAM: PE internal layout — the nine sub-components and the edges that connect them; tile token flowing through DMA_READ → FETCH → GEMM → STORE → DMA_WRITE. -->
|
||||
|
||||
---
|
||||
|
||||
## Detailed Architecture
|
||||
|
||||
This section describes each modeled device-side component in turn.
|
||||
Components are listed in the alphabetical order used by the
|
||||
simulator's source tree.
|
||||
|
||||
### forwarding
|
||||
|
||||
<!-- src: ADR-0037 Context, Decision -->
|
||||
The forwarding component is the generic routing relay used wherever a
|
||||
node only needs to apply a small processing overhead and pass the
|
||||
request to the next hop. NOC routers, conn nodes, and ucie phys all
|
||||
reduce to this. Its first act on receiving a request is to apply the
|
||||
per-node overhead configured for it in the topology spec; after the
|
||||
overhead it simply hands the request to the next hop along the path.
|
||||
|
||||
<!-- src: ADR-0037 Decision, Consequences -->
|
||||
The decision to share one implementation across these roles was made
|
||||
to keep the simulator's component set small without sacrificing
|
||||
modeling fidelity. Each instance still carries its own overhead and
|
||||
its own link bandwidth contributions, so different roles still produce
|
||||
different timing. What is shared is the dispatcher loop, not the
|
||||
parameter values.
|
||||
|
||||
### hbm_ctrl
|
||||
|
||||
<!-- src: ADR-0034 Context, Decision -->
|
||||
The HBM controller is the terminal node for all memory traffic that
|
||||
reaches HBM. Internally it owns a number of pseudo channels, partitioned
|
||||
per-PE so that each PE addresses a deterministic subset. On a request
|
||||
arrival the controller first selects the right pseudo channel from the
|
||||
target address, then enters a chunk-loop that drains the requested
|
||||
size in fixed-size flits over the channel's bandwidth.
|
||||
|
||||
<!-- src: ADR-0034 Decision, Consequences -->
|
||||
The chunk-loop pattern replaces an earlier all-at-once drain. The
|
||||
benefit is that the controller no longer presents a flit-aware fabric
|
||||
with a single bulk transfer; instead it emits flits at a paced rate
|
||||
matching the channel bandwidth, which makes cross-flow contention
|
||||
visible. The bandwidth budget is calibrated against the configured
|
||||
HBM total bandwidth divided across the channel count.
|
||||
|
||||
### io_cpu
|
||||
|
||||
<!-- src: ADR-0036 Context, Decision -->
|
||||
The IO_CPU is the control-plane processor sitting inside the IO chiplet.
|
||||
It receives kernel-launch messages from the host, decodes them, and
|
||||
dispatches per-cube launches to the cube's management CPU. Pure memory
|
||||
operations bypass it entirely, taking the direct data path established
|
||||
inside the IO chiplet.
|
||||
|
||||
<!-- src: ADR-0036 Decision -->
|
||||
On receiving a kernel-launch message, the IO_CPU consults the message's
|
||||
shard list — which already names the target SIP, cube, and PE for each
|
||||
piece of the tensor argument — and forwards a per-cube launch to each
|
||||
cube the kernel needs to reach. This makes the IO_CPU a deterministic
|
||||
fan-out point: it does not decode physical addresses to route, it just
|
||||
follows the explicit per-shard targets it was handed.
|
||||
|
||||
### m_cpu
|
||||
|
||||
<!-- src: ADR-0035 Context, Decision -->
|
||||
The M_CPU is the cube's management processor. It owns two distinct
|
||||
roles: as a control-plane fan-out point for kernel launches arriving
|
||||
from the IO chiplet, and as a DMA endpoint for host-initiated memory
|
||||
writes that need to land in this cube's HBM. The control role
|
||||
forwards launches to the right PE control CPUs; the DMA role places
|
||||
the actual bytes into HBM through the router mesh.
|
||||
|
||||
<!-- src: ADR-0035 Decision -->
|
||||
The component model deliberately distinguishes the two roles because
|
||||
their routing differs: the control fan-out path uses command-kind
|
||||
links that do not appear on data-path routes, while the DMA path uses
|
||||
the same router mesh as PE-initiated DMA, with PE-internal nodes
|
||||
excluded. The routing layer knows about both modes and selects the
|
||||
appropriate adjacency at request time.
|
||||
|
||||
### pcie_ep
|
||||
|
||||
<!-- src: ADR-0038 Context, Decision -->
|
||||
The PCIE endpoint is the protocol boundary at the host-device edge.
|
||||
Its first act on each incoming request is to apply a configured
|
||||
protocol-processing overhead; after that it simply forwards. There is
|
||||
no internal queuing model, no retry, and no TLP-level fidelity — those
|
||||
are deliberately outside scope. The endpoint is bidirectional: host →
|
||||
device traffic (memory writes, kernel launches) flows one way, and
|
||||
device-side outbound traffic (cross-SIP collective sends) flows the
|
||||
other.
|
||||
|
||||
<!-- src: ADR-0038 Decision, Alternatives Considered, Consequences -->
|
||||
A more detailed PCIe model was considered and rejected. The simulator
|
||||
is targeting system-level latency comparisons; making the endpoint
|
||||
heavier with credit-management and retry logic would not improve the
|
||||
metrics being studied. The decision keeps the endpoint as the
|
||||
documented protocol-boundary node, named consistently so routing
|
||||
helpers can locate it by SIP and IO instance.
|
||||
|
||||
### pe_cpu
|
||||
|
||||
<!-- src: ADR-0014 Decision -->
|
||||
The PE control CPU is the entry point for kernel work arriving from
|
||||
the cube's management CPU. It receives kernel-launch messages, resolves
|
||||
the kernel function by name, and hands execution to the scheduler with
|
||||
the resolved tensor arguments. From the scheduler's point of view, the
|
||||
PE_CPU is the upstream source of high-level commands; from the rest
|
||||
of the system's point of view, the PE_CPU is where a kernel's
|
||||
execution begins on a given PE.
|
||||
|
||||
### pe_dma
|
||||
|
||||
<!-- src: ADR-0014 Decision, ADR-0023 Decision -->
|
||||
The DMA engine on each PE has two distinct modes. In the standard PE
|
||||
pipeline it consumes tile tokens issued by the scheduler, acquires a
|
||||
read or write channel (modeled as a one-in-flight resource per
|
||||
direction), and runs the bytes to or from HBM through the mesh. In
|
||||
its collective mode it forwards send tokens for the cube's IPCQ into
|
||||
the fabric, snapshotting the source data at send time so later
|
||||
mutations cannot race the receiver's read. Both modes share the same
|
||||
channel resources but differ in their downstream handling — one
|
||||
returns when the round-trip completes, the other dispatches
|
||||
fire-and-forget.
|
||||
|
||||
### pe_fetch_store
|
||||
|
||||
<!-- src: ADR-0014 Decision -->
|
||||
The fetch-store engine is the bridge between the on-PE scratchpad
|
||||
(TCM) and the register file. It does not run DMA; it only moves bytes
|
||||
internally. On receiving a tile-stage token it sends a short request
|
||||
to the TCM, waits for the bandwidth-serialized delay, and continues
|
||||
the pipeline. The split between this engine and the TCM lets the
|
||||
scratchpad model its own read/write bandwidth independently.
|
||||
|
||||
### pe_gemm
|
||||
|
||||
<!-- src: ADR-0014 Decision -->
|
||||
The GEMM engine is the matrix-multiply compute unit. Tile tokens
|
||||
arriving at this stage carry the per-tile dimensions, and the engine
|
||||
contributes a service time accounting for one fused multiply-add over
|
||||
the tile's macs. Composite operations (where the same tensor pair is
|
||||
streamed across many tiles) reuse the engine through the scheduler;
|
||||
the engine itself is stateless between tiles.
|
||||
|
||||
### pe_ipcq
|
||||
|
||||
<!-- src: ADR-0023 Context, Decision -->
|
||||
The IPCQ — inter-process communication queue — is each PE's
|
||||
collective-communication endpoint. It owns ring buffers that hold
|
||||
inbound messages from neighbor PEs and bookkeeping for send credits.
|
||||
Direction names ("N", "S", "E", "W" for cube-internal neighbors and
|
||||
"global_*" for cross-SIP neighbors) are resolved to physical peer
|
||||
endpoints by a neighbor table installed at process-group creation
|
||||
time. The component itself does not move bytes — it issues DMA tokens
|
||||
through the local PE_DMA, which performs the actual cross-PE
|
||||
transfer.
|
||||
|
||||
<!-- src: ADR-0023 Decision, Consequences -->
|
||||
A key invariant is that the inbound terminal — where data lands at
|
||||
the receiver — pays the link bandwidth drain plus any cube-internal
|
||||
mesh hop to the slot's backing memory. This prevents IPCQ from
|
||||
silently outpacing raw DMA at large transfer sizes. Outbound sends
|
||||
are fire-and-forget; credit return is the only backpressure signal.
|
||||
|
||||
### pe_math
|
||||
|
||||
<!-- src: ADR-0014 Decision -->
|
||||
The math engine handles element-wise and reduction operations. It
|
||||
consumes tile tokens carrying an operation kind (`exp`, `sum`, `max`,
|
||||
`where`, etc.) and contributes a service time proportional to the
|
||||
number of elements processed. Like the GEMM engine it is stateless;
|
||||
chained epilogues (a sequence of math operations after a GEMM tile)
|
||||
are scheduled as separate stages.
|
||||
|
||||
### pe_mmu
|
||||
|
||||
<!-- src: ADR-0039 Context, Decision -->
|
||||
The MMU has two roles, exposed through one component. As a node on
|
||||
the cube NOC it receives MMU-map and MMU-unmap messages and updates
|
||||
its internal page table, so that the runtime API can install
|
||||
virtual-to-physical mappings with measured fabric latency. As a
|
||||
utility object held inside the PE it offers synchronous translate
|
||||
calls to the PE's DMA and GEMM engines without taking simulator time
|
||||
itself; the calling engine pays any configured TLB overhead in its
|
||||
own process.
|
||||
|
||||
<!-- src: ADR-0039 Decision, Alternatives Considered -->
|
||||
The page table supports multiple disjoint regions inside a single
|
||||
page, with later-write-wins semantics on overlap. This is a deliberate
|
||||
simulator stopgap to support parallelization policies that shard data
|
||||
at sub-page granularity without silent mis-routing through a real
|
||||
hardware MMU's one-PA-per-entry assumption. A real MMU does not work
|
||||
this way; the model documents this as a simplification.
|
||||
|
||||
### pe_scheduler
|
||||
|
||||
<!-- src: ADR-0014 Decision -->
|
||||
The scheduler is the sole dispatcher inside a PE. Simple commands are
|
||||
routed directly to the right engine. Composite commands generate a
|
||||
tile plan, and the resulting tile tokens are fed into the pipeline.
|
||||
Self-routing keeps the scheduler off the per-stage hot path: each
|
||||
engine, on finishing a stage, advances the token to the next stage's
|
||||
component itself, so the scheduler only does initial dispatch and
|
||||
completion tracking.
|
||||
|
||||
### pe_tcm
|
||||
|
||||
<!-- src: ADR-0040 Context, Decision -->
|
||||
The TCM is the per-PE tightly-coupled scratchpad memory. It models
|
||||
time only, not data — the actual payload lives in the simulator's
|
||||
memory store. Read and write are independent channels: each is
|
||||
modeled as a one-in-flight resource, so same-direction requests
|
||||
serialize but a read and a write can overlap. The bandwidth of each
|
||||
direction is configured separately and applied as bytes-over-bandwidth
|
||||
on each request.
|
||||
|
||||
<!-- src: ADR-0040 Decision, Alternatives Considered -->
|
||||
The decision to keep read and write on separate channels was made
|
||||
because the PE pipeline's normal case overlaps fetch (read) and store
|
||||
(write). Collapsing them into a single shared channel would have
|
||||
artificially serialized that overlap and produced an incorrect
|
||||
bandwidth ceiling.
|
||||
|
||||
### sram
|
||||
|
||||
<!-- src: ADR-0041 Context, Decision -->
|
||||
The cube SRAM is a per-cube scratchpad attached to one of the cube's
|
||||
routers. As a node it applies a configured access overhead, pays the
|
||||
link-bandwidth drain stamped on the incoming request, and sends a
|
||||
response on the reverse path. It is a terminal — it does not forward.
|
||||
|
||||
<!-- src: ADR-0041 Decision, Consequences -->
|
||||
A second role is as one of three backing-memory tiers (TCM, SRAM, HBM)
|
||||
that an inter-PE collective slot can live in. When the slot lives in
|
||||
SRAM, the PE_DMA pays the slot read or write latency directly using
|
||||
the configured SRAM bandwidth and overhead; the SRAM component does
|
||||
not need to know about collective semantics. This separation keeps
|
||||
the SRAM component agnostic to the collective subsystem.
|
||||
|
||||
### tiling
|
||||
|
||||
<!-- src: ADR-0042 Context, Decision -->
|
||||
The tile-plan generator is not a runtime component — it is a pure
|
||||
module of functions that take a problem shape (matrix dimensions, tile
|
||||
sizes) and produce an ordered list of tile-stage sequences. The
|
||||
scheduler consumes this list. Each tile's stage sequence depends on
|
||||
how its operands are staged: operands streamed from HBM produce
|
||||
DMA_READ stages, operands already resident in TCM (because they were
|
||||
loaded eagerly upfront) skip them.
|
||||
|
||||
<!-- src: ADR-0042 Decision, Consequences -->
|
||||
The plan generator is intentionally pure — given the same input it
|
||||
returns the same plan, with no simulator events created. This lets
|
||||
the rest of the system reason about tile sequences as data, and it
|
||||
makes the plan testable in isolation without simulator state. New
|
||||
plan variants (for example, K-major or DTensor-aware plans) can be
|
||||
added as new functions following the same shape.
|
||||
|
||||
---
|
||||
|
||||
## Implementation Decisions
|
||||
|
||||
This section collects cross-cutting decisions — algorithms, policies,
|
||||
schemes, and contracts — that span multiple components rather than
|
||||
living inside one.
|
||||
|
||||
### Address Scheme
|
||||
|
||||
<!-- src: ADR-0001 Context, Decision -->
|
||||
Every physical address in the simulator decodes into a structured
|
||||
location. A fixed-width physical address carries the SIP id, the
|
||||
cube id within the SIP, a type discriminator (HBM vs PE-resource vs
|
||||
others), and a type-specific offset. HBM addresses additionally encode
|
||||
the per-PE slice offset so the controller can determine which PE
|
||||
owns the target slice without external lookup. The layout is
|
||||
deliberately reserved rather than packed-to-fit, so new sub-units can
|
||||
be added at the type-discriminator level without rewriting existing
|
||||
addresses.
|
||||
|
||||
<!-- src: ADR-0011 Context, Decision -->
|
||||
On top of physical addressing, the simulator supports three address
|
||||
models that the runtime API selects between. Direct physical
|
||||
addressing is retained as a fallback. Virtual addressing — the
|
||||
current default — gives each tensor a contiguous virtual range at
|
||||
deployment, with the per-PE MMU translating per access; an
|
||||
alternative logical-address scheme remains a future option. The
|
||||
virtual-address path is what every modern test path takes; the PA
|
||||
fallback is used by the MMU itself when no mapping exists for an
|
||||
address (a deliberate signal, not an error).
|
||||
|
||||
<!-- src: ADR-0011 Decision, Consequences -->
|
||||
Tensor placement is represented as a list of physical-address shards,
|
||||
each tagged with target SIP, cube, and PE, plus a single tensor-wide
|
||||
virtual base. This means a kernel sees one virtual base for the whole
|
||||
tensor while the host driver and the engine still know exactly where
|
||||
each shard lives. Replicated tensors get per-cube local PA mappings;
|
||||
sharded tensors broadcast their mapping across cubes within a SIP.
|
||||
|
||||
### Routing, Distance & Helper API
|
||||
|
||||
<!-- src: ADR-0002 Context, Decision -->
|
||||
Routing is policy-driven, deterministic, and topology-aware. Given a
|
||||
source, a destination, and an intent — for example, PE-initiated
|
||||
DMA versus host-initiated memory write versus a generic
|
||||
component-to-component query — the routing layer picks the right
|
||||
path. The intent matters because different traffic types must avoid
|
||||
different categories of edges: PE-initiated DMA should not traverse
|
||||
command-only links; M_CPU DMA should not pass through PE-internal
|
||||
pipeline edges; cube-local transfers should not use the
|
||||
zero-distance UCIe bus that would otherwise look attractive to a
|
||||
shortest-path search.
|
||||
|
||||
<!-- src: ADR-0051 Decision -->
|
||||
The routing layer therefore maintains four separate adjacency graphs
|
||||
at construction, each excluding a different category of edges, and
|
||||
picks the appropriate one per intent. On top of the graphs sits a
|
||||
helper API that hides the topology's naming convention: callers ask
|
||||
for the PCIe endpoint of a given SIP, the M_CPU of a given cube, or
|
||||
the HBM destination for a given physical address, and receive the
|
||||
corresponding node id. No component constructs node-id strings
|
||||
directly; if the naming convention ever changes, the change is local
|
||||
to the helper layer.
|
||||
|
||||
<!-- src: ADR-0051 Decision, Consequences -->
|
||||
Path-finding itself uses Dijkstra with explicit per-edge weights
|
||||
(routing weight is allowed to differ from physical distance — for
|
||||
example, UCIe is configured to be routing-preferable). Tie-breaks
|
||||
follow insertion order, which keeps results deterministic. Paths
|
||||
between unreachable nodes raise rather than returning empty, surfacing
|
||||
topology errors immediately.
|
||||
|
||||
### Memory Semantics and Local-HBM Bandwidth
|
||||
|
||||
<!-- src: ADR-0004 Context, Decision -->
|
||||
A PE accessing its own HBM slice through its own cube's NOC must see
|
||||
the full local HBM bandwidth — that is the model's intent. Memory
|
||||
traffic accumulates latency from per-component overhead and
|
||||
bytes-over-link-bandwidth serialization along the path, but the
|
||||
controller does not throttle below the slice's allotted bandwidth.
|
||||
Cross-PE-slice accesses inside the same cube, cross-cube accesses
|
||||
through UCIe, and cross-SIP accesses through PCIe each pay
|
||||
progressively more overhead as the path grows.
|
||||
|
||||
### Topology Compilation, Diagrams & Builder Algorithms
|
||||
|
||||
<!-- src: ADR-0006 Context, Decision -->
|
||||
Topology is configurable, not hardcoded. The simulator reads a YAML
|
||||
spec, compiles it into a flat graph of nodes and edges plus four
|
||||
view projections at different abstraction levels — system, SIP, cube,
|
||||
PE — and uses the compiled graph as the single source for both
|
||||
execution and visualization. Distance metadata used by routing is
|
||||
extracted at compile time so that diagrams and routing decisions
|
||||
agree by construction.
|
||||
|
||||
<!-- src: ADR-0005 Context, Decision -->
|
||||
Diagrams are derived artifacts of the compiled topology. The visualizer
|
||||
produces one SVG per view at the appropriate abstraction level; nothing
|
||||
in the diagrams is hand-drawn or hand-positioned. Distance-aware
|
||||
layout rules place nodes in the diagrams using the same coordinates
|
||||
that routing uses to compute distance, so a diagram that "looks
|
||||
wrong" is a signal that the topology itself has a problem, not the
|
||||
visualizer.
|
||||
|
||||
<!-- src: ADR-0053 Decision -->
|
||||
Inside a cube the router mesh is generated automatically. PE corner
|
||||
positions are fixed by convention; the relay-column algorithm
|
||||
inserts additional grid columns whenever the gap between adjacent PE
|
||||
columns would exceed a tunable maximum. HBM occupies a central
|
||||
exclusion zone — router slots inside the zone are deliberately empty,
|
||||
since HBM controllers attach as separate named nodes. M_CPU and SRAM
|
||||
attach to the nearest router by Euclidean distance from their
|
||||
configured placement coordinates, and UCIe physical lanes distribute
|
||||
along the boundary rows and columns. The whole mesh is cached
|
||||
beside the topology spec and invalidated only when one of a small set
|
||||
of layout-relevant fields changes.
|
||||
|
||||
<!-- DIAGRAM: One cube's router mesh — rows × cols of routers with HBM exclusion zone in the middle, PEs/M_CPU/SRAM attaching to nearest routers, UCIe PHYs along the perimeter. -->
|
||||
|
||||
### Tensor Deployment and Allocation
|
||||
|
||||
<!-- src: ADR-0008 Context, Decision -->
|
||||
Tensor deployment in the runtime API produces a list of physical-address
|
||||
shards plus a single tensor-wide virtual base. The host allocator
|
||||
walks the data-parallelism policy, computes per-shard placement, and
|
||||
emits the per-shard physical addresses through the per-PE allocators.
|
||||
No separate "allocate then later attach to a device" RPC exists —
|
||||
allocation and deployment are a single operation that produces a
|
||||
deployed tensor handle.
|
||||
|
||||
### Memory Allocator Algorithms
|
||||
|
||||
<!-- src: ADR-0048 Context, Decision -->
|
||||
Each per-PE allocator owns two channels — HBM slice and TCM — each
|
||||
backed by an offset-keyed free-list. Allocation is first-fit; freeing
|
||||
coalesces with adjacent free blocks. A device-wide virtual allocator
|
||||
sits above the per-PE allocators, aligns requests up to the configured
|
||||
page size, and coalesces on free in the same way. The trade-off is
|
||||
explicit: first-fit is simpler and cheaper than best-fit or buddy
|
||||
allocation, and the simulator's workload is stack-like enough
|
||||
(deploy / kernel / free in matched order) that fragmentation is not
|
||||
a practical concern.
|
||||
|
||||
<!-- src: ADR-0048 Decision, Consequences -->
|
||||
Allocation failure raises rather than silently returning a partial
|
||||
result. A partial tensor reaching the engine would route over wrong
|
||||
PAs and silently corrupt simulator output, so an out-of-memory signal
|
||||
is preferred. The free path trusts its caller to pass back exactly
|
||||
what was allocated; the small risk of caller error in exchange for
|
||||
fast common-case freeing is documented as a deliberate trade.
|
||||
|
||||
### Kernel Execution and Host-Device Messaging
|
||||
|
||||
<!-- src: ADR-0009 Context, Decision -->
|
||||
Kernel execution decomposes into a small set of messages that travel
|
||||
the device graph. The host issues a single kernel-launch message; the
|
||||
IO_CPU fans it out per-cube; the cube M_CPU fans it out per-PE; the
|
||||
PE CPU resolves the kernel and runs it through the scheduler.
|
||||
Completion flows back the same way, gated by per-shard completion
|
||||
tracking. Memory operations follow the same pattern: a memory write
|
||||
or read travels as one message that the engine routes to the right
|
||||
HBM controller, with a response taking the reverse path.
|
||||
|
||||
<!-- src: ADR-0012 Context, Decision -->
|
||||
The schema between the host and the device-side IO CPU is PA-first
|
||||
and shard-tagged. Every byte of host-issued payload arrives with an
|
||||
explicit target SIP, cube, PE, and physical address. The IO_CPU does
|
||||
not decode addresses to derive placement — placement is named
|
||||
explicitly by the shard list. This makes the host-device interface
|
||||
deterministic and keeps the routing helper free of host-derived
|
||||
intent.
|
||||
|
||||
### CLI Surface and Semantics
|
||||
|
||||
<!-- src: ADR-0010 Context, Decision -->
|
||||
The command-line interface exposes four subcommands. A bench runner
|
||||
loads a topology, resolves a registered benchmark by name or index,
|
||||
and runs it on a selected device. A bench-listing command enumerates
|
||||
the registered benchmarks. A probe utility runs a fixed catalog of
|
||||
traffic patterns through the engine for latency and bandwidth
|
||||
verification. A web viewer renders the topology in a browser. A
|
||||
benchmark instance is always single-device by convention; multi-SIP
|
||||
collective work happens inside the benchmark through the launcher
|
||||
abstraction, not by multiplexing the CLI.
|
||||
|
||||
### Component Port and Wire Fabric Model
|
||||
|
||||
<!-- src: ADR-0015 Context, Decision -->
|
||||
Every modeled component exposes input and output ports, and every
|
||||
edge in the topology connects an output port on one component to an
|
||||
input port on another. Bandwidth and propagation delay are properties
|
||||
of the wire between ports, not of the component endpoints. A
|
||||
component's responsibility is to apply its configured per-node
|
||||
overhead and either forward to the next hop or terminate; the wire
|
||||
charges the byte-over-bandwidth serialization separately.
|
||||
|
||||
<!-- src: ADR-0015 Decision, Consequences -->
|
||||
This separation lets components be swapped behind their port
|
||||
interface without changing the rest of the model, and it keeps
|
||||
bandwidth contention at the wire level where multiple components may
|
||||
contend for the same edge. Future component models can refine
|
||||
internal behavior without disturbing the fabric.
|
||||
|
||||
### Two-Pass Data Execution
|
||||
|
||||
<!-- src: ADR-0020 Context, Decision -->
|
||||
The simulator runs in two passes. The first pass — fast and always
|
||||
on — runs the discrete-event engine and records every data operation
|
||||
in an operation log with timestamps, component identifiers, and per-
|
||||
operation parameters. The second pass — optional, opt-in — replays
|
||||
the log against an in-memory tensor store to produce actual numerical
|
||||
results. Tests that only need timing skip the second pass; tests that
|
||||
need to verify correctness opt in.
|
||||
|
||||
<!-- src: ADR-0020 Decision, Consequences -->
|
||||
The split lets the timing engine remain unconcerned with data
|
||||
semantics: kernels move handles around, not bytes. The replay phase
|
||||
recovers data semantics from the recorded operations, in their
|
||||
original time order with a small set of secondary-sort rules. The
|
||||
op-log records carry enough metadata — input snapshots for compute
|
||||
operations, source snapshots for cross-component copies — that the
|
||||
replay phase cannot mis-order with respect to in-flight mutations.
|
||||
|
||||
### Sim-engine Op Log and Memory Store Schemas
|
||||
|
||||
<!-- src: ADR-0052 Context, Decision -->
|
||||
The operation log holds typed records with seven fields each: start
|
||||
and end timestamps, the component that issued the operation, an
|
||||
operation kind ("memory", "gemm", "math"), an operation name, a
|
||||
parameter dictionary, and a (currently unused) dependency list.
|
||||
Records are kept in stable timestamp order. The parameter dictionary
|
||||
varies by operation: a DMA read carries source address and byte count;
|
||||
a GEMM carries operand shapes, dtypes, and address spaces; a math
|
||||
operation carries input addresses and snapshots.
|
||||
|
||||
<!-- src: ADR-0052 Decision, Consequences -->
|
||||
The companion memory store is a two-level dictionary keyed by
|
||||
address space ("hbm", "tcm", "sram", others) and integer address.
|
||||
Reads and writes are reference-based — no copy by default — so
|
||||
callers wanting to detach a snapshot must copy explicitly. This is
|
||||
deliberate: the engine-internal snapshot paths copy at well-defined
|
||||
points (math input capture, HBM source capture for DMA writes,
|
||||
inbound collective copies) and downstream replay code therefore
|
||||
sees stable data even when slot or scratch addresses are reused by
|
||||
later operations.
|
||||
|
||||
### 2D Grid Program Identity
|
||||
|
||||
<!-- src: ADR-0022 Context, Decision -->
|
||||
Inside a kernel the program identity is two-dimensional. The
|
||||
first axis corresponds to the PE index within a cube; the second
|
||||
corresponds to the cube index within a SIP. Together they let a
|
||||
kernel address its position both within its cube and within the
|
||||
larger system without needing to know the full topology. Total
|
||||
program counts along each axis are exposed symmetrically.
|
||||
|
||||
### Parallelism — SIP Launcher, DPPolicy, Megatron-TP, AHBM Backend, and CCL Algorithm Module
|
||||
|
||||
<!-- src: ADR-0024 Context, Decision -->
|
||||
The launcher model treats each SIP as one rank. Inside a process the
|
||||
launcher spawns one greenlet per SIP rank; the rank is bound to its
|
||||
greenlet so that any code running in that worker sees the right
|
||||
distributed-style rank. This is a deliberately PyTorch-compatible
|
||||
shape: a benchmark looks like a small DDP training script — initialize
|
||||
a process group, spawn workers, each worker runs the same body.
|
||||
|
||||
<!-- src: ADR-0026 Context, Decision -->
|
||||
Data-parallelism policy lives in a single object that names the
|
||||
sharding strategy along the cube axis (replicate, row-wise,
|
||||
column-wise) and along the PE axis (same set of values), and optionally
|
||||
overrides the number of cubes or PEs participating. The policy is
|
||||
intra-device — it does not cross SIP boundaries. SIP-level parallelism
|
||||
is the launcher's responsibility, and the two axes compose
|
||||
orthogonally.
|
||||
|
||||
<!-- src: ADR-0027 Context, Decision -->
|
||||
A Megatron-style tensor-parallel API sits on top of the launcher and
|
||||
the DP policy. Layer-level building blocks — column-parallel linear,
|
||||
row-parallel linear, all-reduce — name their sharding intent in terms
|
||||
the launcher and the placement policy can compose. This is the layer
|
||||
that bench code typically writes against.
|
||||
|
||||
<!-- src: ADR-0047 Context, Decision -->
|
||||
For collective operations the runtime exposes a PyTorch-compatible
|
||||
distributed backend named "ahbm". On process-group initialization the
|
||||
backend loads the configured collective-algorithm module, resolves
|
||||
the world size (priority: explicit ccl.yaml override → defaults
|
||||
section → topology SIP count), imports the algorithm module
|
||||
dynamically, derives the SIP topology kind, and pushes the inter-PE
|
||||
neighbor table to every participating PE. From that point on, an
|
||||
all-reduce call dispatches the algorithm's kernel function across
|
||||
all ranks.
|
||||
|
||||
<!-- src: ADR-0050 Context, Decision -->
|
||||
A collective-algorithm module is a Python module with a small, fixed
|
||||
contract. It exposes topology-kind integer constants, a name-to-kind
|
||||
mapping for the YAML configuration, a kernel-arguments builder, and
|
||||
a kernel function — the kernel function being aliased to the name
|
||||
`kernel` so the backend can find it generically. The kernel itself
|
||||
takes the tensor pointer, the per-cube element count, cube mesh
|
||||
width and height, the world size, the current rank, and the SIP
|
||||
topology dimensions; the backend appends those last four arguments
|
||||
automatically. New collectives slot in by adding a new module that
|
||||
follows this shape.
|
||||
|
||||
<!-- src: ADR-0027 Decision, Consequences -->
|
||||
The combination is deliberate: bench authors get to write code that
|
||||
looks like a regular distributed training script, while the launcher,
|
||||
backend, and placement policies behind it remain free to redirect
|
||||
work to the right SIP, cube, and PE without exposing topology to the
|
||||
kernel.
|
||||
|
||||
### IPCQ Direction Addressing
|
||||
|
||||
<!-- src: ADR-0025 Context, Decision -->
|
||||
Inside a collective algorithm, peer PEs are named by direction —
|
||||
"N", "S", "E", "W" for cube-internal neighbors, and "global_*" for
|
||||
cross-SIP neighbors. Direction addressing is the addressing scheme:
|
||||
the algorithm names a direction, the IPCQ neighbor table installed
|
||||
at process-group time resolves the direction to the peer endpoint's
|
||||
physical-address coordinates, and the PE_DMA performs the actual
|
||||
transfer. The algorithm itself does not see PA arithmetic — direction
|
||||
is the user-facing handle.
|
||||
|
||||
### Intercube All-Reduce
|
||||
|
||||
<!-- src: ADR-0032 Context, Decision -->
|
||||
The default all-reduce algorithm uses a center-rooted bidirectional
|
||||
phase inside each SIP's cube mesh followed by an inter-SIP exchange
|
||||
on the mesh's root cube, and then a bidirectional broadcast back
|
||||
out. Center-rooting halves the in-cube hop count compared with a
|
||||
corner-rooted walk. The inter-SIP exchange itself follows the
|
||||
configured SIP topology — ring, torus, or non-wrapping mesh —
|
||||
selected at runtime through the SIP-topology kind integer the
|
||||
backend passes to the kernel.
|
||||
|
||||
### Evaluation Harnesses
|
||||
|
||||
<!-- src: ADR-0043 Context, Decision -->
|
||||
The all-reduce evaluation harness drives correctness and the
|
||||
latency/buffer-kind sweeps through the public distributed path —
|
||||
initialize process group, spawn workers, call all-reduce — rather
|
||||
than the lower-level engine interface. A shared helper module factors
|
||||
out the setup; sweep tests cover the buffer-kind tiers (TCM, SRAM,
|
||||
HBM) and the inter-SIP topology variants. The plots produced by the
|
||||
harness are part of its output contract; the harness regenerates them
|
||||
on demand.
|
||||
|
||||
<!-- src: ADR-0044 Context, Decision -->
|
||||
The GEMM evaluation harness is split into two layers. A heavy
|
||||
shape-and-variant sweep lives as a manual script — it runs the same
|
||||
composite-GEMM benchmark across many shapes and operand-staging
|
||||
variants, harvests the resulting op-log, and writes a JSON summary.
|
||||
A faster figure-generation layer lives in the test suite and consumes
|
||||
that JSON to render plots. The split keeps the heavy data
|
||||
generation explicit and out of the regular test path.
|
||||
|
||||
### Bench Module Contract
|
||||
|
||||
<!-- src: ADR-0045 Context, Decision -->
|
||||
Adding a new benchmark requires only dropping a file into the
|
||||
benchmarks directory. The file registers one or more benchmark
|
||||
functions through a small decorator that takes a kebab-case name and
|
||||
a human-readable description. The decorator is the registration
|
||||
mechanism — there is no separate manifest. Each benchmark function
|
||||
takes one argument, conventionally named `torch`, which is the
|
||||
runtime context exposing tensor allocation, kernel launch,
|
||||
distributed APIs, and process-spawning. The function name is `run` by
|
||||
convention.
|
||||
|
||||
<!-- src: ADR-0045 Decision, Consequences -->
|
||||
A benchmark must submit at least one operation, or the runner
|
||||
returns an error. A benchmark instance is single-device by default;
|
||||
when a benchmark is collective, it uses the distributed-process-spawn
|
||||
pattern internally — one worker greenlet per rank, with each worker
|
||||
binding to its rank. Multi-device benchmark patterns outside that
|
||||
shape are not supported.
|
||||
|
||||
### Kernel-side `tl.*` API
|
||||
|
||||
<!-- src: ADR-0046 Context, Decision -->
|
||||
Inside a kernel function, the `tl` argument exposes the kernel-side
|
||||
API in a shape that mirrors the conventions of established
|
||||
GPU-kernel languages. Categories: reference handles that name HBM
|
||||
data without issuing DMA; data movement (load, store) that does
|
||||
issue DMA; GEMM and math compute (dot, composite, the unary and
|
||||
binary math operations, reductions); index and scalar helpers
|
||||
(program identity, range-builders); metadata-only operations like
|
||||
transpose; and the collective primitives (send, receive,
|
||||
non-blocking receive). Tensor handles support arithmetic operators
|
||||
via a thread-local active context so kernel code reads naturally.
|
||||
|
||||
<!-- src: ADR-0046 Decision, Consequences -->
|
||||
The API supports two execution modes. A command-list mode records
|
||||
operations into a list without consuming simulator time — useful for
|
||||
inspection and lightweight tests. A greenlet-driven mode runs the
|
||||
kernel as a child greenlet that switches back to the simulator on
|
||||
each `tl.*` call; the simulator drives the event scheduler and hands
|
||||
real data back to the kernel as DMA reads complete. The two modes
|
||||
share the same surface; the kernel does not know which one it is
|
||||
running under.
|
||||
|
||||
### Probe Subcommand
|
||||
|
||||
<!-- src: ADR-0049 Context, Decision -->
|
||||
The probe utility runs three families of traffic patterns through
|
||||
the engine — host-to-device writes at increasing hop counts,
|
||||
device-to-host reads at increasing hop counts, and PE-initiated DMA
|
||||
across the cube mesh — and reports actual latency, the analytical
|
||||
formula breakdown, effective bandwidth, bottleneck bandwidth, and
|
||||
utilization. A fixed reference size is used for the summary table;
|
||||
a separate utilization-versus-size sweep covers a logarithmic range
|
||||
of transfer sizes. Each case runs in its own engine instance so
|
||||
cases do not perturb each other.
|
||||
|
||||
<!-- src: ADR-0049 Decision, Consequences -->
|
||||
The probe also checks a small set of invariants automatically:
|
||||
monotonic latency increase with hop count, device-to-host latency
|
||||
at least as large as host-to-device for the same hop count, and a
|
||||
faster best-case path than worst-case for cross-cube PE DMA. Failures
|
||||
print prominently. The output is meant for human reading; automated
|
||||
parsing should not depend on column widths or whitespace.
|
||||
|
||||
---
|
||||
|
||||
This document summarizes 46 architecture decisions captured during
|
||||
the first half of 2026. It is regenerated mechanically from the
|
||||
decision corpus; sources are recorded in HTML comments throughout.
|
||||
@@ -1,237 +1,20 @@
|
||||
"""Sweep GEMM shapes through kernbench and dump PE_accelerator engine times.
|
||||
|
||||
For each shape:
|
||||
- run benches.matmul_composite via the same run_bench path the CLI uses
|
||||
- read result.engine.op_log
|
||||
- filter to per-PE engines: pe_dma, pe_fetch_store, pe_gemm, pe_math
|
||||
- record sum-of-durations (engine occupancy) AND wall-clock active interval
|
||||
Thin wrapper: the sweep logic now lives in
|
||||
``kernbench.benches.milestone_1h_gemm`` (the single home, ADR-0054, also the
|
||||
``milestone-1h-gemm`` bench). This script remains the manual entry point for
|
||||
regenerating ``docs/diagrams/gemm_sweep.json`` on demand and honors the same
|
||||
``SWEEP_SHAPES`` / ``SWEEP_TOPOLOGY`` env overrides.
|
||||
|
||||
Output: docs/diagrams/gemm_sweep.json
|
||||
python scripts/gemm_sweep.py
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
# Default sweep covering under-tile, single-tile, multi-tile, and asymmetric regimes.
|
||||
# Each entry is either a single integer (square M=K=N=S) or "MxKxN".
|
||||
# Override via env: SWEEP_SHAPES="16,32,16x2048x16,..."
|
||||
DEFAULT_SHAPES = [
|
||||
"32x32x32", # 1 tile, K=32 < TILE_K=64 → under-tile in K
|
||||
"32x64x32", # 1 tile, exact single-tile fit
|
||||
"32x128x32", # 2 tiles, aligned
|
||||
"32x128x128", # 8 tiles, aligned
|
||||
"32x3072x32", # 48 tiles, all K-axis (tall-skinny)
|
||||
"8x128x128", # 8 tiles, but M=8 < TILE_M=32 → MAC array under-fed
|
||||
"128x8x128", # 16 tiles, but K=8 < TILE_K=64 → MAC array under-fed
|
||||
"512", # 2048 tiles, fully aligned — "well-pipelined" reference
|
||||
]
|
||||
|
||||
# Operand-staging variants exercised per shape.
|
||||
VARIANTS = ["ref_ref", "load_ref", "load_load"]
|
||||
|
||||
# Engines whose timings we collect (component_id suffix match).
|
||||
ENGINES = ["pe_dma", "pe_fetch_store", "pe_gemm", "pe_math"]
|
||||
|
||||
# Per-stage breakdown labels (StageType enum names from pe_types.py).
|
||||
STAGES = ["DMA_READ", "DMA_WRITE", "FETCH", "STORE", "GEMM", "MATH"]
|
||||
|
||||
# Scheduler tile sizes (mirror of PeSchedulerComponent.TILE_M/K/N).
|
||||
TILE_M, TILE_K, TILE_N = 32, 64, 32
|
||||
|
||||
OUT_PATH = Path(__file__).parent.parent / "docs" / "diagrams" / "gemm_sweep.json"
|
||||
|
||||
|
||||
def _engine_wall_ns(records, suffix: str) -> float:
|
||||
"""Wall-clock interval the engine was active (union of overlapping ops)."""
|
||||
intervals = [(r.t_start, r.t_end) for r in records
|
||||
if r.component_id.endswith("." + suffix)]
|
||||
if not intervals:
|
||||
return 0.0
|
||||
intervals.sort()
|
||||
merged_end = intervals[0][1]
|
||||
merged_start = intervals[0][0]
|
||||
total = 0.0
|
||||
for s, e in intervals[1:]:
|
||||
if s <= merged_end:
|
||||
merged_end = max(merged_end, e)
|
||||
else:
|
||||
total += merged_end - merged_start
|
||||
merged_start, merged_end = s, e
|
||||
total += merged_end - merged_start
|
||||
return total
|
||||
|
||||
|
||||
def _engine_occupancy_ns(records, suffix: str) -> float:
|
||||
return sum(r.t_end - r.t_start for r in records
|
||||
if r.component_id.endswith("." + suffix))
|
||||
|
||||
|
||||
def _engine_count(records, suffix: str) -> int:
|
||||
return sum(1 for r in records if r.component_id.endswith("." + suffix))
|
||||
|
||||
|
||||
def _stage_occupancy_ns(records, stage_type: str) -> float:
|
||||
"""Sum t_end - t_start over op_log records whose params.stage_type matches.
|
||||
|
||||
Requires op_log records produced post the TileToken stage_type capture
|
||||
(sim_engine/op_log.py).
|
||||
"""
|
||||
return sum(
|
||||
r.t_end - r.t_start
|
||||
for r in records
|
||||
if r.params.get("stage_type") == stage_type
|
||||
)
|
||||
|
||||
|
||||
def _stage_wall_ns(records, stage_type: str) -> float:
|
||||
"""Interval-union wall-clock for records whose stage_type matches."""
|
||||
intervals = sorted(
|
||||
(r.t_start, r.t_end) for r in records
|
||||
if r.params.get("stage_type") == stage_type
|
||||
)
|
||||
if not intervals:
|
||||
return 0.0
|
||||
total = 0.0
|
||||
cs, ce = intervals[0]
|
||||
for s, e in intervals[1:]:
|
||||
if s <= ce:
|
||||
ce = max(ce, e)
|
||||
else:
|
||||
total += ce - cs
|
||||
cs, ce = s, e
|
||||
total += ce - cs
|
||||
return total
|
||||
|
||||
|
||||
def _stage_count(records, stage_type: str) -> int:
|
||||
return sum(1 for r in records if r.params.get("stage_type") == stage_type)
|
||||
|
||||
|
||||
def _run_one(M: int, K: int, N: int, topology: str, variant: str = "ref_ref") -> dict:
|
||||
os.environ["MATMUL_M"] = str(M)
|
||||
os.environ["MATMUL_K"] = str(K)
|
||||
os.environ["MATMUL_N"] = str(N)
|
||||
os.environ["MATMUL_VARIANT"] = variant
|
||||
|
||||
# Late imports so env vars are read by matmul_composite at module load.
|
||||
# Force re-import to pick up new env values.
|
||||
for mod_name in [m for m in list(sys.modules) if m.startswith("kernbench.benches.matmul_composite")]:
|
||||
del sys.modules[mod_name]
|
||||
|
||||
from kernbench.benches.registry import resolve as resolve_bench
|
||||
from kernbench.runtime_api.bench_runner import run_bench
|
||||
from kernbench.runtime_api.types import resolve_device
|
||||
from kernbench.sim_engine.engine import GraphEngine
|
||||
from kernbench.topology.builder import resolve_topology
|
||||
|
||||
topo = resolve_topology(topology)
|
||||
bench = resolve_bench("matmul-composite").run
|
||||
device = resolve_device(None)
|
||||
|
||||
t0 = time.time()
|
||||
result = run_bench(
|
||||
topology=topo, bench_fn=bench, device=device,
|
||||
engine_factory=lambda t, d: GraphEngine(
|
||||
getattr(t, "topology_obj", t), enable_data=True,
|
||||
),
|
||||
)
|
||||
wall = time.time() - t0
|
||||
|
||||
op_log = result.engine.op_log
|
||||
if not result.completion.ok:
|
||||
raise RuntimeError(f"bench failed at M={M},K={K},N={N}: {result.completion}")
|
||||
|
||||
# Bytes touched at f16 (2 B): full A + full B + full out (each operand
|
||||
# streamed once through HBM by the composite plan).
|
||||
bytes_total = (M * K + K * N + M * N) * 2
|
||||
row = {
|
||||
"M": M, "K": K, "N": N,
|
||||
"variant": variant,
|
||||
"flops": 2 * M * K * N,
|
||||
"bytes_hbm": bytes_total,
|
||||
"arith_intensity": (2 * M * K * N) / bytes_total, # flops/byte
|
||||
"tile_count_expected": _ceil(M, TILE_M) * _ceil(N, TILE_N) * _ceil(K, TILE_K),
|
||||
"sim_wall_clock_s": round(wall, 3),
|
||||
"engines": {},
|
||||
}
|
||||
for eng in ENGINES:
|
||||
row["engines"][eng] = {
|
||||
"occupancy_ns": _engine_occupancy_ns(op_log, eng),
|
||||
"wall_ns": _engine_wall_ns(op_log, eng),
|
||||
"record_count": _engine_count(op_log, eng),
|
||||
}
|
||||
row["stages"] = {}
|
||||
for stage in STAGES:
|
||||
row["stages"][stage] = {
|
||||
"occupancy_ns": _stage_occupancy_ns(op_log, stage),
|
||||
"wall_ns": _stage_wall_ns(op_log, stage),
|
||||
"record_count": _stage_count(op_log, stage),
|
||||
}
|
||||
# Kernel-window wall-clock = max t_end - min t_start over PE engine records.
|
||||
pe_records = [r for r in op_log
|
||||
if any(r.component_id.endswith("." + e) for e in ENGINES)]
|
||||
if pe_records:
|
||||
row["pe_window_ns"] = max(r.t_end for r in pe_records) \
|
||||
- min(r.t_start for r in pe_records)
|
||||
else:
|
||||
row["pe_window_ns"] = 0.0
|
||||
stage_records = [r for r in op_log
|
||||
if r.params.get("stage_type") in STAGES]
|
||||
if stage_records:
|
||||
row["composite_window_ns"] = max(r.t_end for r in stage_records) \
|
||||
- min(r.t_start for r in stage_records)
|
||||
else:
|
||||
row["composite_window_ns"] = 0.0
|
||||
return row
|
||||
|
||||
|
||||
def _ceil(a: int, b: int) -> int:
|
||||
return (a + b - 1) // b
|
||||
from kernbench.benches.milestone_1h_gemm import run_sweep
|
||||
|
||||
|
||||
def main() -> int:
|
||||
shapes_env = os.environ.get("SWEEP_SHAPES")
|
||||
raw = (shapes_env.split(",") if shapes_env else DEFAULT_SHAPES)
|
||||
shapes: list[tuple[int, int, int]] = []
|
||||
for s in raw:
|
||||
s = s.strip()
|
||||
if not s:
|
||||
continue
|
||||
if "x" in s.lower():
|
||||
parts = s.lower().split("x")
|
||||
shapes.append((int(parts[0]), int(parts[1]), int(parts[2])))
|
||||
else:
|
||||
v = int(s)
|
||||
shapes.append((v, v, v))
|
||||
topology = os.environ.get("SWEEP_TOPOLOGY", "topology.yaml")
|
||||
|
||||
rows = []
|
||||
for M, K, N in shapes:
|
||||
for variant in VARIANTS:
|
||||
print(f"[sweep] M={M} K={K} N={N} variant={variant} ...", flush=True)
|
||||
row = _run_one(M, K, N, topology, variant=variant)
|
||||
rows.append(row)
|
||||
eng_dma = row["engines"]["pe_dma"]
|
||||
eng_gem = row["engines"]["pe_gemm"]
|
||||
print(f" tiles={row['tile_count_expected']:>6} "
|
||||
f"pe_window={row['pe_window_ns']:8.1f}ns "
|
||||
f"dma_occ={eng_dma['occupancy_ns']:9.1f} "
|
||||
f"gemm_occ={eng_gem['occupancy_ns']:8.1f} "
|
||||
f"(sim {row['sim_wall_clock_s']:.1f}s)")
|
||||
|
||||
OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
|
||||
OUT_PATH.write_text(json.dumps({
|
||||
"tile_sizes": {"M": TILE_M, "K": TILE_K, "N": TILE_N},
|
||||
"engines": ENGINES,
|
||||
"stages": STAGES,
|
||||
"variants": VARIANTS,
|
||||
"rows": rows,
|
||||
}, indent=2))
|
||||
print(f"\n[sweep] wrote {OUT_PATH}")
|
||||
run_sweep()
|
||||
return 0
|
||||
|
||||
|
||||
|
||||
|
After Width: | Height: | Size: 38 KiB |
|
After Width: | Height: | Size: 36 KiB |
@@ -0,0 +1,13 @@
|
||||
buffer_kind,sip_topology,n_sips,n_elem,bytes_per_pe,latency_ns
|
||||
hbm,torus_2d,6,128,256,2120.040000000012
|
||||
hbm,torus_2d,6,1024,2048,2717.2783333333473
|
||||
hbm,torus_2d,6,8192,16384,7315.184999999989
|
||||
hbm,torus_2d,6,32768,65536,23081.26500000037
|
||||
sram,torus_2d,6,128,256,2060.040000000012
|
||||
sram,torus_2d,6,1024,2048,2909.2783333333473
|
||||
sram,torus_2d,6,8192,16384,9523.184999999869
|
||||
sram,torus_2d,6,32768,65536,32201.265000000385
|
||||
tcm,torus_2d,6,128,256,1964.040000000012
|
||||
tcm,torus_2d,6,1024,2048,2477.2783333333473
|
||||
tcm,torus_2d,6,8192,16384,6403.185000000109
|
||||
tcm,torus_2d,6,32768,65536,19865.265000000378
|
||||
|
|
After Width: | Height: | Size: 75 KiB |
|
After Width: | Height: | Size: 37 KiB |
|
After Width: | Height: | Size: 86 KiB |
@@ -0,0 +1,37 @@
|
||||
algorithm,sip_topology,n_sips,n_elem,bytes_per_pe,bytes_per_sip,latency_ns
|
||||
lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,8,16,256,2666.552500000015
|
||||
lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,32,64,1024,2747.7400000000152
|
||||
lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,64,128,2048,2855.990000000018
|
||||
lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,128,256,4096,3072.490000000019
|
||||
lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,512,1024,16384,3337.1133333333582
|
||||
lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,1024,2048,32768,3708.0333333333692
|
||||
lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,2048,4096,65536,4449.873333333393
|
||||
lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,4096,8192,131072,5933.020000000124
|
||||
lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,8192,16384,262144,8900.379999999863
|
||||
lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,16384,32768,524288,14835.099999999224
|
||||
lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,32768,65536,1048576,26704.540000000765
|
||||
lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,49152,98304,1572864,38573.97999999701
|
||||
lrab_hierarchical_allreduce,ring_1d,6,8,16,256,2365.255833333347
|
||||
lrab_hierarchical_allreduce,ring_1d,6,32,64,1024,2436.9433333333473
|
||||
lrab_hierarchical_allreduce,ring_1d,6,64,128,2048,2532.526666666683
|
||||
lrab_hierarchical_allreduce,ring_1d,6,128,256,4096,2723.693333333349
|
||||
lrab_hierarchical_allreduce,ring_1d,6,512,1024,16384,3048.635000000021
|
||||
lrab_hierarchical_allreduce,ring_1d,6,1024,2048,32768,3393.4016666666957
|
||||
lrab_hierarchical_allreduce,ring_1d,6,2048,4096,65536,4082.401666666714
|
||||
lrab_hierarchical_allreduce,ring_1d,6,4096,8192,131072,5458.80166666677
|
||||
lrab_hierarchical_allreduce,ring_1d,6,8192,16384,262144,8216.934999999943
|
||||
lrab_hierarchical_allreduce,ring_1d,6,16384,32768,524288,13733.201666665835
|
||||
lrab_hierarchical_allreduce,ring_1d,6,32768,65536,1048576,24765.73500000064
|
||||
lrab_hierarchical_allreduce,ring_1d,6,49152,98304,1572864,35798.268333331536
|
||||
lrab_hierarchical_allreduce,torus_2d,6,8,16,256,1700.6025000000095
|
||||
lrab_hierarchical_allreduce,torus_2d,6,32,64,1024,1753.2900000000102
|
||||
lrab_hierarchical_allreduce,torus_2d,6,64,128,2048,1823.540000000012
|
||||
lrab_hierarchical_allreduce,torus_2d,6,128,256,4096,1964.040000000012
|
||||
lrab_hierarchical_allreduce,torus_2d,6,512,1024,16384,2196.8183333333463
|
||||
lrab_hierarchical_allreduce,torus_2d,6,1024,2048,32768,2477.2783333333473
|
||||
lrab_hierarchical_allreduce,torus_2d,6,2048,4096,65536,3038.1983333333583
|
||||
lrab_hierarchical_allreduce,torus_2d,6,4096,8192,131072,4159.5050000000665
|
||||
lrab_hierarchical_allreduce,torus_2d,6,8192,16384,262144,6403.185000000109
|
||||
lrab_hierarchical_allreduce,torus_2d,6,16384,32768,524288,10890.5449999995
|
||||
lrab_hierarchical_allreduce,torus_2d,6,32768,65536,1048576,19865.265000000378
|
||||
lrab_hierarchical_allreduce,torus_2d,6,49152,98304,1572864,28839.98500000059
|
||||
|
|
After Width: | Height: | Size: 194 KiB |
|
After Width: | Height: | Size: 40 KiB |
|
After Width: | Height: | Size: 45 KiB |
|
After Width: | Height: | Size: 45 KiB |
@@ -0,0 +1,180 @@
|
||||
"""Mesh-native bidirectional Ring-K/V attention kernel — prefill (ADR-0059 Proposed).
|
||||
|
||||
Each rank holds its own Q tile and 1/n_ranks of K, V (sequence-sharded).
|
||||
Over ``n_ranks - 1`` bidirectional steps, K and V propagate both east and
|
||||
west: chunk c_i originating at rank i reaches rank j at step ``|i - j|``.
|
||||
Every rank receives every other rank's chunk **exactly once** and folds it
|
||||
into a running ``(m, ℓ, o)`` via the online-softmax recurrence. After all
|
||||
steps each rank holds the final attention output for its own Q tokens —
|
||||
no cross-rank merge is required.
|
||||
|
||||
Supersedes ADR-0055's closed-ring ``_attention_ring_kv.py``. Both modules
|
||||
stay on disk during the transition; this one runs on the hardware's
|
||||
actual open-mesh wiring (no closed-ring SFR install required).
|
||||
|
||||
Imported by ``milestone_gqa_llama70b`` (after the bench's Phase 2 switches
|
||||
its imports) and invoked through ``torch.launch(...)`` — not through
|
||||
``dist.all_reduce(...)``. See ADR-0055 Context for why this kernel is not
|
||||
backend-dispatched via ADR-0050's algorithm-module contract.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
from kernbench.common.pe_commands import TensorHandle
|
||||
|
||||
|
||||
def _view(handle: TensorHandle, new_shape: tuple[int, ...]) -> TensorHandle:
|
||||
"""Reshape — metadata only, no command emitted (cf. ``tl.trans``)."""
|
||||
return TensorHandle(
|
||||
id=handle.id,
|
||||
addr=handle.addr,
|
||||
shape=new_shape,
|
||||
dtype=handle.dtype,
|
||||
nbytes=handle.nbytes,
|
||||
data=handle.data,
|
||||
space=handle.space,
|
||||
pinned=handle.pinned,
|
||||
)
|
||||
|
||||
|
||||
def _partial_attention(
|
||||
Q: TensorHandle,
|
||||
K: TensorHandle,
|
||||
V: TensorHandle,
|
||||
S_q: int,
|
||||
S_kv_per_rank: int,
|
||||
h_q: int,
|
||||
d_head: int,
|
||||
tl,
|
||||
) -> tuple[TensorHandle, TensorHandle, TensorHandle]:
|
||||
"""One pass of partial attention against (K, V).
|
||||
|
||||
Emits 1 GEMM(Q·K^T) + softmax + max + sub + exp + sum + 1 GEMM(P·V).
|
||||
Returns the running-statistics triplet ``(m, ℓ, O_partial)`` for the
|
||||
online-softmax mlo merge.
|
||||
"""
|
||||
K_2d_T = _view(K, (h_q * d_head, S_kv_per_rank))
|
||||
V_2d = _view(V, (S_kv_per_rank, h_q * d_head))
|
||||
|
||||
scores = tl.dot(Q, K_2d_T)
|
||||
m = tl.max(scores, axis=-1)
|
||||
P = tl.softmax(scores, axis=-1)
|
||||
scores_centered = scores - m
|
||||
exp_scores = tl.exp(scores_centered)
|
||||
ell = tl.sum(exp_scores, axis=-1)
|
||||
O_partial = tl.dot(P, V_2d)
|
||||
return m, ell, O_partial
|
||||
|
||||
|
||||
def attention_mesh_kv_kernel(
|
||||
q_ptr: int,
|
||||
k_ptr: int,
|
||||
v_ptr: int,
|
||||
o_ptr: int,
|
||||
S_q: int,
|
||||
S_kv_per_rank: int,
|
||||
h_q: int,
|
||||
h_kv: int,
|
||||
d_head: int,
|
||||
n_ranks: int,
|
||||
rank_axis: int = 0,
|
||||
*,
|
||||
tl,
|
||||
) -> None:
|
||||
"""Mesh-native bidirectional Ring-K/V attention — see module docstring.
|
||||
|
||||
``rank_axis`` selects which program-id dimension carries the ring rank:
|
||||
0 — single_user_* panels: rank == tl.program_id(axis=0) (PE id in cube).
|
||||
1 — multi_user_* panels: ring is at the cube level. Only PE 0 in each
|
||||
cube participates; the other 7 hold KV replicas but stay silent.
|
||||
"""
|
||||
# For multi_user (rank_axis=1) only PE 0 in each cube runs the ring.
|
||||
if rank_axis != 0 and tl.program_id(axis=0) != 0:
|
||||
return
|
||||
rank = tl.program_id(axis=rank_axis)
|
||||
has_E = rank < n_ranks - 1
|
||||
has_W = rank > 0
|
||||
|
||||
# Q stays put on this rank — loaded once, used in every partial attention.
|
||||
Q = tl.load(q_ptr, shape=(S_q, h_q * d_head), dtype="f16")
|
||||
|
||||
# Local K, V chunk.
|
||||
K = tl.load(k_ptr, shape=(S_kv_per_rank, h_kv, d_head), dtype="f16")
|
||||
V = tl.load(v_ptr, shape=(S_kv_per_rank, h_kv, d_head), dtype="f16")
|
||||
|
||||
# Step 0 (local): partial attention against own K, V — initializes the
|
||||
# running triplet (m, ℓ, o).
|
||||
m, ell, o = _partial_attention(
|
||||
Q, K, V, S_q, S_kv_per_rank, h_q, d_head, tl,
|
||||
)
|
||||
|
||||
# Seed bidirectional waves with own chunk (step-1 send).
|
||||
to_send_east_K: TensorHandle | None = K
|
||||
to_send_east_V: TensorHandle | None = V
|
||||
to_send_west_K: TensorHandle | None = K
|
||||
to_send_west_V: TensorHandle | None = V
|
||||
|
||||
# Bidirectional fan-out: n_ranks - 1 steps. By step k, the wave from
|
||||
# rank i has reached rank (i ± k). After n_ranks - 1 steps, every rank
|
||||
# has merged every other rank's chunk exactly once (ADR-0059 D3).
|
||||
for step in range(1, n_ranks):
|
||||
# Send the eastbound wave we currently hold (own at step 1; forwarded
|
||||
# at later steps). ``None`` means we have no wave to forward this
|
||||
# direction this step (edge rank, or the wave already passed by).
|
||||
if has_E and to_send_east_K is not None:
|
||||
tl.send(dir="E", src=to_send_east_K)
|
||||
tl.send(dir="E", src=to_send_east_V)
|
||||
if has_W and to_send_west_K is not None:
|
||||
tl.send(dir="W", src=to_send_west_K)
|
||||
tl.send(dir="W", src=to_send_west_V)
|
||||
|
||||
# Receive eastbound wave from W (carries chunk c_{rank - step}).
|
||||
K_from_W: TensorHandle | None = None
|
||||
V_from_W: TensorHandle | None = None
|
||||
if has_W and (rank - step) >= 0:
|
||||
K_from_W = tl.recv(
|
||||
dir="W", shape=(S_kv_per_rank, h_kv, d_head), dtype="f16",
|
||||
)
|
||||
V_from_W = tl.recv(
|
||||
dir="W", shape=(S_kv_per_rank, h_kv, d_head), dtype="f16",
|
||||
)
|
||||
m_new, ell_new, o_new = _partial_attention(
|
||||
Q, K_from_W, V_from_W, S_q, S_kv_per_rank, h_q, d_head, tl,
|
||||
)
|
||||
m_combined = tl.maximum(m, m_new)
|
||||
scale_old = tl.exp(m - m_combined)
|
||||
scale_new = tl.exp(m_new - m_combined)
|
||||
ell = ell * scale_old + ell_new * scale_new
|
||||
o = o * scale_old + o_new * scale_new
|
||||
m = m_combined
|
||||
|
||||
# Receive westbound wave from E (carries chunk c_{rank + step}).
|
||||
K_from_E: TensorHandle | None = None
|
||||
V_from_E: TensorHandle | None = None
|
||||
if has_E and (rank + step) < n_ranks:
|
||||
K_from_E = tl.recv(
|
||||
dir="E", shape=(S_kv_per_rank, h_kv, d_head), dtype="f16",
|
||||
)
|
||||
V_from_E = tl.recv(
|
||||
dir="E", shape=(S_kv_per_rank, h_kv, d_head), dtype="f16",
|
||||
)
|
||||
m_new, ell_new, o_new = _partial_attention(
|
||||
Q, K_from_E, V_from_E, S_q, S_kv_per_rank, h_q, d_head, tl,
|
||||
)
|
||||
m_combined = tl.maximum(m, m_new)
|
||||
scale_old = tl.exp(m - m_combined)
|
||||
scale_new = tl.exp(m_new - m_combined)
|
||||
ell = ell * scale_old + ell_new * scale_new
|
||||
o = o * scale_old + o_new * scale_new
|
||||
m = m_combined
|
||||
|
||||
# Forward what we received for next step. ``None`` propagates: if no
|
||||
# chunk arrived this step (out-of-bounds wave origin), there is
|
||||
# nothing to forward next step in that direction.
|
||||
to_send_east_K = K_from_W
|
||||
to_send_east_V = V_from_W
|
||||
to_send_west_K = K_from_E
|
||||
to_send_west_V = V_from_E
|
||||
|
||||
# Final normalize: O := o / ℓ.
|
||||
O_final = o / ell
|
||||
tl.store(o_ptr, O_final)
|
||||
@@ -0,0 +1,151 @@
|
||||
"""Mesh-native bidirectional AllReduce-mlo attention — decode (ADR-0059 Proposed).
|
||||
|
||||
Every rank holds the full Q (replicated, small at ``S_q=1``) and 1/n_ranks
|
||||
of KV (sequence-sharded). Each rank computes its partial attention
|
||||
against own KV in ONE shot, then runs a bidirectional fan-out of the
|
||||
``(m, ℓ, o)`` triplet: the triplet originating at rank i reaches rank j at
|
||||
step ``|i - j|``. Every rank merges every other rank's triplet exactly
|
||||
once over ``n_ranks - 1`` steps, ending with the final answer replicated
|
||||
on every rank.
|
||||
|
||||
Supersedes ADR-0056's closed-ring ``_attention_allreduce_mlo.py``. Both
|
||||
modules stay on disk during the transition; this one runs on the
|
||||
hardware's actual open-mesh wiring (no closed-ring SFR install required).
|
||||
|
||||
Imported by ``milestone_gqa_llama70b`` (after the bench's Phase 2 switches
|
||||
its imports) and invoked through ``torch.launch(...)`` — not through
|
||||
``dist.all_reduce(...)``. See ADR-0056 Context for why this kernel is not
|
||||
backend-dispatched via ADR-0050's algorithm-module contract.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
from kernbench.common.pe_commands import TensorHandle
|
||||
|
||||
|
||||
def _view(handle: TensorHandle, new_shape: tuple[int, ...]) -> TensorHandle:
|
||||
"""Reshape — metadata only, no command emitted (cf. ``tl.trans``)."""
|
||||
return TensorHandle(
|
||||
id=handle.id,
|
||||
addr=handle.addr,
|
||||
shape=new_shape,
|
||||
dtype=handle.dtype,
|
||||
nbytes=handle.nbytes,
|
||||
data=handle.data,
|
||||
space=handle.space,
|
||||
pinned=handle.pinned,
|
||||
)
|
||||
|
||||
|
||||
def attention_mesh_mlo_kernel(
|
||||
q_ptr: int,
|
||||
k_ptr: int,
|
||||
v_ptr: int,
|
||||
o_ptr: int,
|
||||
S_q: int,
|
||||
S_kv_per_rank: int,
|
||||
h_q: int,
|
||||
h_kv: int,
|
||||
d_head: int,
|
||||
n_ranks: int,
|
||||
rank_axis: int = 0,
|
||||
*,
|
||||
tl,
|
||||
) -> None:
|
||||
"""Mesh-native bidirectional AllReduce-mlo — see module docstring.
|
||||
|
||||
``rank_axis`` selects which program-id dimension carries the ring rank:
|
||||
0 — single_user_* panels: rank == tl.program_id(axis=0) (PE id in cube).
|
||||
1 — multi_user_* panels: ring is at the cube level. Only PE 0 in each
|
||||
cube participates; the other 7 hold KV replicas but stay silent.
|
||||
"""
|
||||
# For multi_user (rank_axis=1) only PE 0 in each cube runs the ring.
|
||||
if rank_axis != 0 and tl.program_id(axis=0) != 0:
|
||||
return
|
||||
rank = tl.program_id(axis=rank_axis)
|
||||
has_E = rank < n_ranks - 1
|
||||
has_W = rank > 0
|
||||
|
||||
# Q is replicated on every rank — loaded once.
|
||||
Q = tl.load(q_ptr, shape=(S_q, h_q * d_head), dtype="f16")
|
||||
|
||||
# Local KV chunk. KV is sequence-sharded and stays put on this rank for
|
||||
# the entire fan-out — distinguishing decode from prefill (ADR-0059 D3)
|
||||
# where KV circulates.
|
||||
K = tl.load(k_ptr, shape=(S_kv_per_rank, h_kv, d_head), dtype="f16")
|
||||
V = tl.load(v_ptr, shape=(S_kv_per_rank, h_kv, d_head), dtype="f16")
|
||||
|
||||
# ── One-shot local partial attention ──────────────────────────
|
||||
K_2d_T = _view(K, (h_q * d_head, S_kv_per_rank))
|
||||
V_2d = _view(V, (S_kv_per_rank, h_q * d_head))
|
||||
scores = tl.dot(Q, K_2d_T)
|
||||
m = tl.max(scores, axis=-1)
|
||||
P = tl.softmax(scores, axis=-1)
|
||||
scores_centered = scores - m
|
||||
exp_scores = tl.exp(scores_centered)
|
||||
ell = tl.sum(exp_scores, axis=-1)
|
||||
o = tl.dot(P, V_2d)
|
||||
|
||||
# Seed bidirectional waves with own triplet (step-1 send).
|
||||
to_send_east_m: TensorHandle | None = m
|
||||
to_send_east_ell: TensorHandle | None = ell
|
||||
to_send_east_o: TensorHandle | None = o
|
||||
to_send_west_m: TensorHandle | None = m
|
||||
to_send_west_ell: TensorHandle | None = ell
|
||||
to_send_west_o: TensorHandle | None = o
|
||||
|
||||
# Bidirectional fan-out of (m, ℓ, o) triplets — n_ranks - 1 steps.
|
||||
for step in range(1, n_ranks):
|
||||
# Send eastbound triplet (own at step 1; forwarded at later steps).
|
||||
if has_E and to_send_east_m is not None:
|
||||
tl.send(dir="E", src=to_send_east_m)
|
||||
tl.send(dir="E", src=to_send_east_ell)
|
||||
tl.send(dir="E", src=to_send_east_o)
|
||||
# Send westbound triplet.
|
||||
if has_W and to_send_west_m is not None:
|
||||
tl.send(dir="W", src=to_send_west_m)
|
||||
tl.send(dir="W", src=to_send_west_ell)
|
||||
tl.send(dir="W", src=to_send_west_o)
|
||||
|
||||
# Receive eastbound triplet from W (originated at rank - step).
|
||||
m_from_W: TensorHandle | None = None
|
||||
ell_from_W: TensorHandle | None = None
|
||||
o_from_W: TensorHandle | None = None
|
||||
if has_W and (rank - step) >= 0:
|
||||
m_from_W = tl.recv(dir="W", shape=m.shape, dtype="f16")
|
||||
ell_from_W = tl.recv(dir="W", shape=ell.shape, dtype="f16")
|
||||
o_from_W = tl.recv(dir="W", shape=o.shape, dtype="f16")
|
||||
m_combined = tl.maximum(m, m_from_W)
|
||||
scale_old = tl.exp(m - m_combined)
|
||||
scale_new = tl.exp(m_from_W - m_combined)
|
||||
ell = ell * scale_old + ell_from_W * scale_new
|
||||
o = o * scale_old + o_from_W * scale_new
|
||||
m = m_combined
|
||||
|
||||
# Receive westbound triplet from E (originated at rank + step).
|
||||
m_from_E: TensorHandle | None = None
|
||||
ell_from_E: TensorHandle | None = None
|
||||
o_from_E: TensorHandle | None = None
|
||||
if has_E and (rank + step) < n_ranks:
|
||||
m_from_E = tl.recv(dir="E", shape=m.shape, dtype="f16")
|
||||
ell_from_E = tl.recv(dir="E", shape=ell.shape, dtype="f16")
|
||||
o_from_E = tl.recv(dir="E", shape=o.shape, dtype="f16")
|
||||
m_combined = tl.maximum(m, m_from_E)
|
||||
scale_old = tl.exp(m - m_combined)
|
||||
scale_new = tl.exp(m_from_E - m_combined)
|
||||
ell = ell * scale_old + ell_from_E * scale_new
|
||||
o = o * scale_old + o_from_E * scale_new
|
||||
m = m_combined
|
||||
|
||||
# Forward the original received triplet (not the merged running state)
|
||||
# so neighbors get the original wave. ``None`` propagates if nothing
|
||||
# arrived this step.
|
||||
to_send_east_m = m_from_W
|
||||
to_send_east_ell = ell_from_W
|
||||
to_send_east_o = o_from_W
|
||||
to_send_west_m = m_from_E
|
||||
to_send_west_ell = ell_from_E
|
||||
to_send_west_o = o_from_E
|
||||
|
||||
# Final normalize: O := o / ℓ.
|
||||
O_final = o / ell
|
||||
tl.store(o_ptr, O_final)
|
||||
@@ -0,0 +1,568 @@
|
||||
"""milestone-1h-gemm bench: GEMM evaluation harness (sweep + figures).
|
||||
|
||||
Self-contained milestone bench (ADR-0054). Holds the shape×variant sweep
|
||||
and the figure renderers; the ``run(torch)`` entry at the bottom runs the
|
||||
sweep (or reuses the committed JSON when ``MILESTONE_FAST=1``) and writes
|
||||
every figure into ``benches/1H_milestone_output/gemm/``.
|
||||
|
||||
This is the single home for the GEMM eval logic: the figure tests import a
|
||||
thin re-export shim (``tests/gemm/_gemm_plot_helpers.py``), as does the
|
||||
``scripts/gemm_sweep.py`` wrapper.
|
||||
|
||||
The sweep drives ``matmul-composite`` across shapes×variants through the
|
||||
same ``run_bench`` path the CLI uses, harvests ``result.engine.op_log``,
|
||||
and writes the sweep JSON. The renderers read that JSON and emit matplotlib
|
||||
PNGs. No simulation in the renderers — they are fast.
|
||||
|
||||
Chart set (mirrors the GEMM MAC slides in scripts/build_overview_slides.py):
|
||||
- stage breakdown (load_ref operand staging)
|
||||
- MAC utilization — measured (load_ref)
|
||||
- MAC utilization — theoretical vs measured (load_ref)
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
from kernbench.benches.registry import bench
|
||||
from kernbench.policy.placement.dp import DPPolicy
|
||||
|
||||
ROOT = Path(__file__).resolve().parents[3]
|
||||
DEFAULT_SWEEP_JSON = ROOT / "docs" / "diagrams" / "gemm_sweep.json"
|
||||
DEFAULT_PLOTS_DIR = ROOT / "docs" / "diagrams" / "gemm_plots"
|
||||
_OUTPUT_DIR = Path(__file__).resolve().parent / "1H_milestone_output" / "gemm"
|
||||
|
||||
# ── sweep configuration ────────────────────────────────────────────────
|
||||
|
||||
# Default sweep covering under-tile, single-tile, multi-tile, and asymmetric
|
||||
# regimes. Each entry is "MxKxN" or a single int (square M=K=N).
|
||||
# Override via env: SWEEP_SHAPES="16,32,16x2048x16,..."
|
||||
DEFAULT_SHAPES = [
|
||||
"32x32x32", # 1 tile, K=32 < TILE_K=64 → under-tile in K
|
||||
"32x64x32", # 1 tile, exact single-tile fit
|
||||
"32x128x32", # 2 tiles, aligned
|
||||
"32x128x128", # 8 tiles, aligned
|
||||
"32x3072x32", # 48 tiles, all K-axis (tall-skinny)
|
||||
"8x128x128", # 8 tiles, but M=8 < TILE_M=32 → MAC array under-fed
|
||||
"128x8x128", # 16 tiles, but K=8 < TILE_K=64 → MAC array under-fed
|
||||
"512", # 2048 tiles, fully aligned — "well-pipelined" reference
|
||||
]
|
||||
|
||||
# Operand-staging variants exercised per shape.
|
||||
VARIANTS = ["ref_ref", "load_ref", "load_load"]
|
||||
|
||||
# Engines whose timings we collect (component_id suffix match).
|
||||
ENGINES = ["pe_dma", "pe_fetch_store", "pe_gemm", "pe_math"]
|
||||
|
||||
# Per-stage breakdown labels (StageType enum names from pe_types.py).
|
||||
STAGES = ["DMA_READ", "DMA_WRITE", "FETCH", "STORE", "GEMM", "MATH"]
|
||||
|
||||
# Scheduler tile sizes (mirror of PeSchedulerComponent.TILE_M/K/N).
|
||||
TILE_M, TILE_K, TILE_N = 32, 64, 32
|
||||
|
||||
|
||||
def _ceil(a: int, b: int) -> int:
|
||||
return (a + b - 1) // b
|
||||
|
||||
|
||||
def _engine_wall_ns(records, suffix: str) -> float:
|
||||
"""Wall-clock interval the engine was active (union of overlapping ops)."""
|
||||
intervals = [(r.t_start, r.t_end) for r in records
|
||||
if r.component_id.endswith("." + suffix)]
|
||||
if not intervals:
|
||||
return 0.0
|
||||
intervals.sort()
|
||||
merged_end = intervals[0][1]
|
||||
merged_start = intervals[0][0]
|
||||
total = 0.0
|
||||
for s, e in intervals[1:]:
|
||||
if s <= merged_end:
|
||||
merged_end = max(merged_end, e)
|
||||
else:
|
||||
total += merged_end - merged_start
|
||||
merged_start, merged_end = s, e
|
||||
total += merged_end - merged_start
|
||||
return total
|
||||
|
||||
|
||||
def _engine_occupancy_ns(records, suffix: str) -> float:
|
||||
return sum(r.t_end - r.t_start for r in records
|
||||
if r.component_id.endswith("." + suffix))
|
||||
|
||||
|
||||
def _engine_count(records, suffix: str) -> int:
|
||||
return sum(1 for r in records if r.component_id.endswith("." + suffix))
|
||||
|
||||
|
||||
def _stage_occupancy_ns(records, stage_type: str) -> float:
|
||||
return sum(
|
||||
r.t_end - r.t_start
|
||||
for r in records
|
||||
if r.params.get("stage_type") == stage_type
|
||||
)
|
||||
|
||||
|
||||
def _stage_wall_ns(records, stage_type: str) -> float:
|
||||
"""Interval-union wall-clock for records whose stage_type matches."""
|
||||
intervals = sorted(
|
||||
(r.t_start, r.t_end) for r in records
|
||||
if r.params.get("stage_type") == stage_type
|
||||
)
|
||||
if not intervals:
|
||||
return 0.0
|
||||
total = 0.0
|
||||
cs, ce = intervals[0]
|
||||
for s, e in intervals[1:]:
|
||||
if s <= ce:
|
||||
ce = max(ce, e)
|
||||
else:
|
||||
total += ce - cs
|
||||
cs, ce = s, e
|
||||
total += ce - cs
|
||||
return total
|
||||
|
||||
|
||||
def _stage_count(records, stage_type: str) -> int:
|
||||
return sum(1 for r in records if r.params.get("stage_type") == stage_type)
|
||||
|
||||
|
||||
def _run_one(M: int, K: int, N: int, topology: str, variant: str = "ref_ref") -> dict:
|
||||
os.environ["MATMUL_M"] = str(M)
|
||||
os.environ["MATMUL_K"] = str(K)
|
||||
os.environ["MATMUL_N"] = str(N)
|
||||
os.environ["MATMUL_VARIANT"] = variant
|
||||
|
||||
# Late imports so env vars are read by matmul_composite at module load.
|
||||
# Force re-import to pick up new env values.
|
||||
for mod_name in [m for m in list(sys.modules)
|
||||
if m.startswith("kernbench.benches.matmul_composite")]:
|
||||
del sys.modules[mod_name]
|
||||
|
||||
from kernbench.benches.registry import resolve as resolve_bench
|
||||
from kernbench.runtime_api.bench_runner import run_bench
|
||||
from kernbench.runtime_api.types import resolve_device
|
||||
from kernbench.sim_engine.engine import GraphEngine
|
||||
from kernbench.topology.builder import resolve_topology
|
||||
|
||||
topo = resolve_topology(topology)
|
||||
bench = resolve_bench("matmul-composite").run
|
||||
device = resolve_device(None)
|
||||
|
||||
t0 = time.time()
|
||||
result = run_bench(
|
||||
topology=topo, bench_fn=bench, device=device,
|
||||
engine_factory=lambda t, d: GraphEngine(
|
||||
getattr(t, "topology_obj", t), enable_data=True,
|
||||
),
|
||||
)
|
||||
wall = time.time() - t0
|
||||
|
||||
op_log = result.engine.op_log
|
||||
if not result.completion.ok:
|
||||
raise RuntimeError(f"bench failed at M={M},K={K},N={N}: {result.completion}")
|
||||
|
||||
# Bytes touched at f16 (2 B): full A + full B + full out (each operand
|
||||
# streamed once through HBM by the composite plan).
|
||||
bytes_total = (M * K + K * N + M * N) * 2
|
||||
row = {
|
||||
"M": M, "K": K, "N": N,
|
||||
"variant": variant,
|
||||
"flops": 2 * M * K * N,
|
||||
"bytes_hbm": bytes_total,
|
||||
"arith_intensity": (2 * M * K * N) / bytes_total, # flops/byte
|
||||
"tile_count_expected": _ceil(M, TILE_M) * _ceil(N, TILE_N) * _ceil(K, TILE_K),
|
||||
"sim_wall_clock_s": round(wall, 3),
|
||||
"engines": {},
|
||||
}
|
||||
for eng in ENGINES:
|
||||
row["engines"][eng] = {
|
||||
"occupancy_ns": _engine_occupancy_ns(op_log, eng),
|
||||
"wall_ns": _engine_wall_ns(op_log, eng),
|
||||
"record_count": _engine_count(op_log, eng),
|
||||
}
|
||||
row["stages"] = {}
|
||||
for stage in STAGES:
|
||||
row["stages"][stage] = {
|
||||
"occupancy_ns": _stage_occupancy_ns(op_log, stage),
|
||||
"wall_ns": _stage_wall_ns(op_log, stage),
|
||||
"record_count": _stage_count(op_log, stage),
|
||||
}
|
||||
# Kernel-window wall-clock = max t_end - min t_start over PE engine records.
|
||||
pe_records = [r for r in op_log
|
||||
if any(r.component_id.endswith("." + e) for e in ENGINES)]
|
||||
if pe_records:
|
||||
row["pe_window_ns"] = max(r.t_end for r in pe_records) \
|
||||
- min(r.t_start for r in pe_records)
|
||||
else:
|
||||
row["pe_window_ns"] = 0.0
|
||||
stage_records = [r for r in op_log
|
||||
if r.params.get("stage_type") in STAGES]
|
||||
if stage_records:
|
||||
row["composite_window_ns"] = max(r.t_end for r in stage_records) \
|
||||
- min(r.t_start for r in stage_records)
|
||||
else:
|
||||
row["composite_window_ns"] = 0.0
|
||||
return row
|
||||
|
||||
|
||||
def _parse_shapes(raw) -> list[tuple[int, int, int]]:
|
||||
shapes: list[tuple[int, int, int]] = []
|
||||
for s in raw:
|
||||
s = s.strip()
|
||||
if not s:
|
||||
continue
|
||||
if "x" in s.lower():
|
||||
parts = s.lower().split("x")
|
||||
shapes.append((int(parts[0]), int(parts[1]), int(parts[2])))
|
||||
else:
|
||||
v = int(s)
|
||||
shapes.append((v, v, v))
|
||||
return shapes
|
||||
|
||||
|
||||
def run_sweep(out_json: Path | str = DEFAULT_SWEEP_JSON) -> Path:
|
||||
"""Drive matmul-composite across shapes×variants; write the sweep JSON.
|
||||
|
||||
Honors ``SWEEP_SHAPES`` / ``SWEEP_TOPOLOGY`` env overrides (same as the
|
||||
historical ``scripts/gemm_sweep.py``). Returns the JSON path written.
|
||||
"""
|
||||
shapes_env = os.environ.get("SWEEP_SHAPES")
|
||||
raw = (shapes_env.split(",") if shapes_env else DEFAULT_SHAPES)
|
||||
shapes = _parse_shapes(raw)
|
||||
topology = os.environ.get("SWEEP_TOPOLOGY", "topology.yaml")
|
||||
|
||||
rows = []
|
||||
for M, K, N in shapes:
|
||||
for variant in VARIANTS:
|
||||
print(f"[sweep] M={M} K={K} N={N} variant={variant} ...", flush=True)
|
||||
row = _run_one(M, K, N, topology, variant=variant)
|
||||
rows.append(row)
|
||||
eng_dma = row["engines"]["pe_dma"]
|
||||
eng_gem = row["engines"]["pe_gemm"]
|
||||
print(f" tiles={row['tile_count_expected']:>6} "
|
||||
f"pe_window={row['pe_window_ns']:8.1f}ns "
|
||||
f"dma_occ={eng_dma['occupancy_ns']:9.1f} "
|
||||
f"gemm_occ={eng_gem['occupancy_ns']:8.1f} "
|
||||
f"(sim {row['sim_wall_clock_s']:.1f}s)")
|
||||
|
||||
out_json = Path(out_json)
|
||||
out_json.parent.mkdir(parents=True, exist_ok=True)
|
||||
out_json.write_text(json.dumps({
|
||||
"tile_sizes": {"M": TILE_M, "K": TILE_K, "N": TILE_N},
|
||||
"engines": ENGINES,
|
||||
"stages": STAGES,
|
||||
"variants": VARIANTS,
|
||||
"rows": rows,
|
||||
}, indent=2))
|
||||
print(f"\n[sweep] wrote {out_json}")
|
||||
return out_json
|
||||
|
||||
|
||||
# ── figure rendering ───────────────────────────────────────────────────
|
||||
|
||||
# Shapes excluded from the figures (mirrors build_overview_slides).
|
||||
EXCLUDED_SHAPES = {(512, 512, 512)}
|
||||
|
||||
# Stage bars shown (raw op_log stage_type keys) + display names + colors.
|
||||
STAGE_KEYS = ["DMA_READ", "FETCH", "GEMM", "DMA_WRITE"]
|
||||
STAGE_DISPLAY = {
|
||||
"DMA_READ": "DMA in",
|
||||
"FETCH": "Fetch",
|
||||
"GEMM": "GEMM",
|
||||
"DMA_WRITE": "DMA out",
|
||||
}
|
||||
STAGE_COLORS = {
|
||||
"DMA_READ": "#3B82F6",
|
||||
"FETCH": "#10B981",
|
||||
"GEMM": "#F59E0B",
|
||||
"DMA_WRITE": "#A855F7",
|
||||
}
|
||||
|
||||
# MAC-utilization model constants (mirror build_overview_slides).
|
||||
_HBM_GBS = 256.0
|
||||
_BPE = 2
|
||||
_T_STAGE = 16.0
|
||||
_D_STAGES = 3
|
||||
|
||||
_PLOT_VARIANT = "load_ref"
|
||||
|
||||
|
||||
def _load_sweep_data(sweep_json: Path | str = DEFAULT_SWEEP_JSON) -> dict:
|
||||
sweep_json = Path(sweep_json)
|
||||
if not sweep_json.exists():
|
||||
return {"rows": []}
|
||||
data = json.loads(sweep_json.read_text())
|
||||
data["rows"] = [
|
||||
r for r in data.get("rows", [])
|
||||
if (r["M"], r["K"], r["N"]) not in EXCLUDED_SHAPES
|
||||
]
|
||||
return data
|
||||
|
||||
|
||||
def _shape_label(r: dict) -> str:
|
||||
if r["M"] == r["K"] == r["N"]:
|
||||
return f"M=K=N={r['M']}"
|
||||
return f"M={r['M']} K={r['K']} N={r['N']}"
|
||||
|
||||
|
||||
def _under_tile(M, K, N, tile_M, tile_K, tile_N) -> bool:
|
||||
return M < tile_M or K < tile_K or N < tile_N
|
||||
|
||||
|
||||
def _xtick_labels(shape_labels, tile_counts, flagged) -> list[str]:
|
||||
out = []
|
||||
for lbl, tc, fl in zip(shape_labels, tile_counts, flagged):
|
||||
s = f"{lbl}\n({tc} tiles)"
|
||||
if fl:
|
||||
s += " *"
|
||||
out.append(s)
|
||||
return out
|
||||
|
||||
|
||||
def _grouped_bar_png(
|
||||
out_name: str, *, out_dir: Path, title: str, subtitle: str | None,
|
||||
shape_labels, tile_counts, flagged, series: dict, colors: dict,
|
||||
y_label: str, threshold: float | None = None, footnote: str | None = None,
|
||||
) -> str:
|
||||
"""Render one grouped-bar chart to out_dir/out_name; return the path."""
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
|
||||
n_groups = len(shape_labels)
|
||||
n_series = max(1, len(series))
|
||||
x = np.arange(n_groups)
|
||||
width = 0.8 / n_series
|
||||
|
||||
fig, ax = plt.subplots(figsize=(11, 6))
|
||||
for i, (name, vals) in enumerate(series.items()):
|
||||
offset = (i - (n_series - 1) / 2) * width
|
||||
ax.bar(x + offset, vals, width, label=name, color=colors.get(name))
|
||||
|
||||
ax.set_xticks(x)
|
||||
ax.set_xticklabels(
|
||||
_xtick_labels(shape_labels, tile_counts, flagged), fontsize=8,
|
||||
)
|
||||
ax.set_ylabel(y_label)
|
||||
ax.set_title(title, fontsize=13, fontweight="bold")
|
||||
if subtitle:
|
||||
ax.text(0.5, 1.01, subtitle, transform=ax.transAxes, ha="center",
|
||||
va="bottom", fontsize=8, color="#475569")
|
||||
if threshold is not None:
|
||||
ax.axhline(threshold, ls="--", color="gray", lw=1.0)
|
||||
ax.legend(fontsize=8, loc="upper right")
|
||||
ax.grid(True, axis="y", alpha=0.3)
|
||||
|
||||
caption = "* = under-tile shape (M<TILE_M, K<TILE_K, or N<TILE_N)"
|
||||
if footnote:
|
||||
caption = footnote + "\n" + caption
|
||||
fig.text(0.5, 0.01, caption, ha="center", fontsize=7, color="gray",
|
||||
wrap=True)
|
||||
|
||||
fig.tight_layout(rect=(0, 0.05, 1, 1))
|
||||
out_dir = Path(out_dir)
|
||||
out_dir.mkdir(parents=True, exist_ok=True)
|
||||
out = out_dir / out_name
|
||||
fig.savefig(out, dpi=120)
|
||||
plt.close(fig)
|
||||
return str(out)
|
||||
|
||||
|
||||
def emit_stage_breakdown(
|
||||
sweep_json: Path | str = DEFAULT_SWEEP_JSON,
|
||||
out_dir: Path | str = DEFAULT_PLOTS_DIR,
|
||||
) -> str | None:
|
||||
"""Per-stage engine wall-clock per shape (load_ref operand staging)."""
|
||||
data = _load_sweep_data(sweep_json)
|
||||
rows = [r for r in data["rows"] if r.get("variant") == _PLOT_VARIANT]
|
||||
if not rows:
|
||||
return None
|
||||
tile = data["tile_sizes"]
|
||||
shape_labels = [_shape_label(r) for r in rows]
|
||||
flagged = [_under_tile(r["M"], r["K"], r["N"], tile["M"], tile["K"], tile["N"])
|
||||
for r in rows]
|
||||
tile_counts = [r["tile_count_expected"] for r in rows]
|
||||
series = {
|
||||
STAGE_DISPLAY[s]: [r.get("stages", {}).get(s, {}).get("wall_ns", 0.0)
|
||||
for r in rows]
|
||||
for s in STAGE_KEYS
|
||||
}
|
||||
colors = {STAGE_DISPLAY[s]: STAGE_COLORS[s] for s in STAGE_KEYS}
|
||||
return _grouped_bar_png(
|
||||
"gemm_stage_breakdown.png", out_dir=Path(out_dir),
|
||||
title="GEMM stage breakdown",
|
||||
subtitle=(f"Per-stage engine wall-clock (DMA in / Fetch / GEMM / "
|
||||
f"DMA out), {_PLOT_VARIANT} staging. "
|
||||
f"Tile {tile['M']}x{tile['K']}x{tile['N']}."),
|
||||
shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
|
||||
series=series, colors=colors, y_label="ns",
|
||||
footnote="Bars = engine wall-clock interval (merged overlaps).",
|
||||
)
|
||||
|
||||
|
||||
def emit_mac_utilization_measured(
|
||||
sweep_json: Path | str = DEFAULT_SWEEP_JSON,
|
||||
out_dir: Path | str = DEFAULT_PLOTS_DIR,
|
||||
) -> str | None:
|
||||
"""GEMM util % and useful pipeline-eff % (analytical model, load_ref)."""
|
||||
data = _load_sweep_data(sweep_json)
|
||||
rows = data["rows"]
|
||||
if not rows:
|
||||
return None
|
||||
tile = data["tile_sizes"]
|
||||
TILE_M, TILE_K, TILE_N = tile["M"], tile["K"], tile["N"]
|
||||
tile_flops = 2 * TILE_M * TILE_K * TILE_N
|
||||
dma_w_per_pair = (TILE_M * TILE_N * _BPE) / _HBM_GBS
|
||||
head_ns = (_D_STAGES - 1) * _T_STAGE
|
||||
|
||||
by_shape = {(r["M"], r["K"], r["N"]): r
|
||||
for r in rows if r["variant"] == _PLOT_VARIANT}
|
||||
shapes = list(by_shape)
|
||||
if not shapes:
|
||||
return None
|
||||
shape_labels = [_shape_label(by_shape[k]) for k in shapes]
|
||||
flagged = [_under_tile(*k, TILE_M, TILE_K, TILE_N) for k in shapes]
|
||||
tile_counts = [by_shape[k]["tile_count_expected"] for k in shapes]
|
||||
|
||||
gemm_util, useful_eff = [], []
|
||||
for k in shapes:
|
||||
r = by_shape[k]
|
||||
M, K, N = r["M"], r["K"], r["N"]
|
||||
useful = 2 * M * K * N
|
||||
tiles = r["tile_count_expected"]
|
||||
gu = useful / (tile_flops * tiles) * 100
|
||||
gemm_util.append(gu)
|
||||
m_tiles = (M + TILE_M - 1) // TILE_M
|
||||
n_tiles = (N + TILE_N - 1) // TILE_N
|
||||
n_mn = m_tiles * n_tiles
|
||||
compute_total = tiles * _T_STAGE
|
||||
wall = head_ns + tiles * _T_STAGE + max(0, n_mn - 1) * dma_w_per_pair
|
||||
ueff = (compute_total * (gu / 100.0) / wall) * 100 if wall > 0 else 0.0
|
||||
useful_eff.append(ueff)
|
||||
|
||||
series = {"GEMM util %": gemm_util, "Useful eff %": useful_eff}
|
||||
colors = {"GEMM util %": "#10B981", "Useful eff %": "#F59E0B"}
|
||||
return _grouped_bar_png(
|
||||
"gemm_mac_utilization_measured.png", out_dir=Path(out_dir),
|
||||
title="GEMM MAC utilization — load_ref",
|
||||
subtitle=("GEMM util = useful FLOPs / (tile FLOPs x tiles); "
|
||||
"Useful eff = GEMM util x ideal pipeline efficiency."),
|
||||
shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
|
||||
series=series, colors=colors, y_label="%", threshold=100.0,
|
||||
footnote="Theoretical ideal-pipeline model (not simulator data).",
|
||||
)
|
||||
|
||||
|
||||
def emit_mac_utilization_theoretical_vs_measured(
|
||||
sweep_json: Path | str = DEFAULT_SWEEP_JSON,
|
||||
out_dir: Path | str = DEFAULT_PLOTS_DIR,
|
||||
) -> str | None:
|
||||
"""Theoretical vs simulator-measured GEMM util / useful eff (load_ref)."""
|
||||
data = _load_sweep_data(sweep_json)
|
||||
rows = data["rows"]
|
||||
if not rows:
|
||||
return None
|
||||
tile = data["tile_sizes"]
|
||||
TILE_M, TILE_K, TILE_N = tile["M"], tile["K"], tile["N"]
|
||||
tile_flops = 2 * TILE_M * TILE_K * TILE_N
|
||||
dma_w_per_pair = (TILE_M * TILE_N * _BPE) / _HBM_GBS
|
||||
head_ns = (_D_STAGES - 1) * _T_STAGE
|
||||
peak_per_ns = tile_flops / _T_STAGE
|
||||
|
||||
by_shape = {(r["M"], r["K"], r["N"]): r
|
||||
for r in rows if r["variant"] == _PLOT_VARIANT}
|
||||
shapes = list(by_shape)
|
||||
if not shapes:
|
||||
return None
|
||||
shape_labels = [_shape_label(by_shape[k]) for k in shapes]
|
||||
flagged = [_under_tile(*k, TILE_M, TILE_K, TILE_N) for k in shapes]
|
||||
tile_counts = [by_shape[k]["tile_count_expected"] for k in shapes]
|
||||
|
||||
gu_t, gu_m, eff_t, eff_m = [], [], [], []
|
||||
for k in shapes:
|
||||
r = by_shape[k]
|
||||
M, K, N = r["M"], r["K"], r["N"]
|
||||
useful = 2 * M * K * N
|
||||
tiles = r["tile_count_expected"]
|
||||
gut = useful / (tile_flops * tiles)
|
||||
gu_t.append(gut * 100)
|
||||
rec = r.get("stages", {}).get("GEMM", {}).get("record_count", 0) or tiles
|
||||
gu_m.append((useful / (tile_flops * rec) * 100) if rec else 0.0)
|
||||
m_tiles = (M + TILE_M - 1) // TILE_M
|
||||
n_tiles = (N + TILE_N - 1) // TILE_N
|
||||
n_mn = m_tiles * n_tiles
|
||||
compute_total = tiles * _T_STAGE
|
||||
wall_t = head_ns + compute_total + max(0, n_mn - 1) * dma_w_per_pair
|
||||
eff_t.append((compute_total * gut / wall_t * 100) if wall_t > 0 else 0.0)
|
||||
cw = r.get("composite_window_ns", 0.0) or 0.0
|
||||
eff_m.append((useful / cw / peak_per_ns * 100) if cw > 0 else 0.0)
|
||||
|
||||
series = {
|
||||
"GEMM util % (theoretical)": gu_t,
|
||||
"GEMM util % (measured)": gu_m,
|
||||
"Theoretical eff %": eff_t,
|
||||
"Measured eff %": eff_m,
|
||||
}
|
||||
colors = {
|
||||
"GEMM util % (theoretical)": "#10B981",
|
||||
"GEMM util % (measured)": "#6EE7B7",
|
||||
"Theoretical eff %": "#F59E0B",
|
||||
"Measured eff %": "#3B82F6",
|
||||
}
|
||||
return _grouped_bar_png(
|
||||
"gemm_mac_utilization_theoretical_vs_measured.png", out_dir=Path(out_dir),
|
||||
title="GEMM MAC utilization — theoretical vs measured (load_ref)",
|
||||
subtitle=("theoretical model vs simulator op_log; agreement "
|
||||
"validates the analytical pipeline model."),
|
||||
shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
|
||||
series=series, colors=colors, y_label="%", threshold=100.0,
|
||||
)
|
||||
|
||||
|
||||
def emit_all_gemm_plots(
|
||||
sweep_json: Path | str = DEFAULT_SWEEP_JSON,
|
||||
out_dir: Path | str = DEFAULT_PLOTS_DIR,
|
||||
) -> list[str]:
|
||||
"""Render every GEMM figure that has data; return the paths written."""
|
||||
paths = []
|
||||
for fn in (emit_stage_breakdown,
|
||||
emit_mac_utilization_measured,
|
||||
emit_mac_utilization_theoretical_vs_measured):
|
||||
p = fn(sweep_json, out_dir)
|
||||
if p:
|
||||
paths.append(p)
|
||||
return paths
|
||||
|
||||
|
||||
# ── bench entry ────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
@bench(
|
||||
name="milestone-1h-gemm",
|
||||
description="1H milestone: regenerate all GEMM results + figures.",
|
||||
)
|
||||
def run(torch) -> None:
|
||||
"""Run the GEMM sweep (or reuse committed JSON) and render every figure.
|
||||
|
||||
``MILESTONE_FAST=1`` reuses the committed ``DEFAULT_SWEEP_JSON`` (seconds);
|
||||
otherwise the full sweep runs into ``out_dir/gemm_sweep.json`` (minutes).
|
||||
The sweep drives its own engines, so a sentinel tensor is submitted at the
|
||||
end to satisfy the run_bench contract (ADR-0045 D4).
|
||||
"""
|
||||
_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
|
||||
fast = bool(os.environ.get("MILESTONE_FAST"))
|
||||
if fast:
|
||||
sweep_json = DEFAULT_SWEEP_JSON
|
||||
else:
|
||||
sweep_json = run_sweep(out_json=_OUTPUT_DIR / "gemm_sweep.json")
|
||||
paths = emit_all_gemm_plots(sweep_json=sweep_json, out_dir=_OUTPUT_DIR)
|
||||
print(f" milestone-1h-gemm: {len(paths)} figures -> {_OUTPUT_DIR} "
|
||||
f"(fast={fast})")
|
||||
|
||||
torch.zeros(
|
||||
(1, 1), dtype="f16",
|
||||
dp=DPPolicy(cube="row_wise", pe="replicate", num_cubes=1, num_pes=1),
|
||||
name="milestone_gemm_sentinel",
|
||||
)
|
||||
@@ -609,6 +609,7 @@ class RuntimeContext:
|
||||
kernel_fn: Any,
|
||||
*args: Any,
|
||||
_defer_wait: bool = False,
|
||||
_auto_dim_remap: bool = True,
|
||||
**kwargs: Any,
|
||||
) -> RequestHandle:
|
||||
"""Register and launch a kernel (like a fused torch op).
|
||||
@@ -700,21 +701,36 @@ class RuntimeContext:
|
||||
return t.shape
|
||||
# ADR-0026: DPPolicy no longer crosses SIP boundaries; cube + PE
|
||||
# are the only axes that shrink the local shape.
|
||||
# Mirror the tensor allocator's precedence (context.py L471-484):
|
||||
# DPPolicy.num_cubes overrides the topology's cube count when set.
|
||||
# Without this, multi_user panels at validation scale
|
||||
# (DPPolicy.num_cubes=4) get sharded as if the topology's full
|
||||
# cube count (16) applied — see test_launch_dim_translation.py.
|
||||
if dp.cube != "replicate":
|
||||
eff_num_cubes = (
|
||||
dp.num_cubes if dp.num_cubes is not None else self._num_cubes
|
||||
)
|
||||
if dp.cube == "column_wise":
|
||||
K = K // self._num_cubes
|
||||
K = K // eff_num_cubes
|
||||
elif dp.cube == "row_wise":
|
||||
M = M // self._num_cubes
|
||||
M = M // eff_num_cubes
|
||||
if len(t.shape) < 2:
|
||||
return (K,)
|
||||
return (M, K)
|
||||
|
||||
# Auto-dim-remap (opt-out via _auto_dim_remap=False). Legacy
|
||||
# kernels (e.g. va_offset bench) pass global dims as scalars and
|
||||
# rely on launch to rewrite them to local. Mesh attention kernels
|
||||
# already receive cube-local dims (S_kv_per_rank, d_head, …) and
|
||||
# opt out — the remap would otherwise collide d_head=64 with K's
|
||||
# global M=64 and rewrite d_head. See test_launch_dim_translation.py.
|
||||
dim_map: dict[int, int] = {} # global_dim → local_dim
|
||||
for t in tensor_args:
|
||||
local = _compute_local_shape(t)
|
||||
for g, l in zip(t.shape if len(t.shape) >= 2 else (1, t.shape[0]), local if len(local) >= 2 else (1, local[0])):
|
||||
if g != l:
|
||||
dim_map[g] = l
|
||||
if _auto_dim_remap:
|
||||
for t in tensor_args:
|
||||
local = _compute_local_shape(t)
|
||||
for g, l in zip(t.shape if len(t.shape) >= 2 else (1, t.shape[0]), local if len(local) >= 2 else (1, local[0])):
|
||||
if g != l:
|
||||
dim_map[g] = l
|
||||
|
||||
# Per-SIP kernel launch: each SIP gets TensorArgs with local va_base
|
||||
last_handle = None
|
||||
|
||||
@@ -25,6 +25,37 @@ class DataExecutor:
|
||||
def __init__(self, op_log: list[OpRecord], store: MemoryStore) -> None:
|
||||
self._op_log = op_log
|
||||
self.store = store
|
||||
# Per-slot time-ordered shape-keyed history. Populated on every
|
||||
# ipcq_copy WRITE; consulted on reads that find a shape-mismatched
|
||||
# value in MemoryStore (the slot was wrapped by a later inbound
|
||||
# before this read's Phase 2 turn). Required because Phase 1 cannot
|
||||
# snapshot math-output sources at outbound time (math executes only
|
||||
# in Phase 2), so token.data is None and slot wraps lose the recv-
|
||||
# time value. See test_attention_mesh_decode_diag (ADR-0059 mesh).
|
||||
self._slot_history: dict[tuple[str, int], list[tuple[float, Any]]] = {}
|
||||
|
||||
def _resolve_read(
|
||||
self, space: str, addr: int,
|
||||
shape: tuple[int, ...] | None, dtype: str | None,
|
||||
t_at_or_before: float,
|
||||
) -> Any:
|
||||
"""Read (space, addr) with expected shape. On KeyError or shape
|
||||
mismatch in MemoryStore, fall back to ``_slot_history`` for the
|
||||
most recent shape-matching entry with t_write <= t_at_or_before.
|
||||
Returns None when no match is found."""
|
||||
try:
|
||||
return self.store.read(space, addr, shape=shape, dtype=dtype)
|
||||
except (KeyError, ValueError):
|
||||
pass
|
||||
hist = self._slot_history.get((space, addr))
|
||||
if hist is None:
|
||||
return None
|
||||
for t_w, val in reversed(hist):
|
||||
if t_w > t_at_or_before:
|
||||
continue
|
||||
if shape is None or getattr(val, "shape", None) == shape:
|
||||
return val
|
||||
return None
|
||||
|
||||
# Ordering priority within the same t_start: memory copies must run
|
||||
# before math/gemm so that slot data is populated before a consumer
|
||||
@@ -87,14 +118,23 @@ class DataExecutor:
|
||||
# only get populated by Phase 2's math replay).
|
||||
data = p.get("snapshot")
|
||||
if data is None:
|
||||
try:
|
||||
data = self.store.read(
|
||||
src_space, src_addr,
|
||||
shape=p.get("shape"), dtype=p.get("dtype"),
|
||||
)
|
||||
except KeyError:
|
||||
data = self._resolve_read(
|
||||
src_space, src_addr,
|
||||
p.get("shape"), p.get("dtype"), op.t_start,
|
||||
)
|
||||
if data is None:
|
||||
return
|
||||
self.store.write(dst_space, dst_addr, data)
|
||||
# Record this write in slot history so a later forwarded read
|
||||
# at src=dst_addr (a different ipcq_copy whose src is this slot)
|
||||
# can recover by shape even after the slot has been wrapped.
|
||||
if op.op_name == "ipcq_copy":
|
||||
self._slot_history.setdefault(
|
||||
(dst_space, dst_addr), [],
|
||||
).append((
|
||||
op.t_start,
|
||||
data.copy() if hasattr(data, "copy") else data,
|
||||
))
|
||||
|
||||
def _execute_gemm(self, op: OpRecord) -> None:
|
||||
"""Execute GEMM: out = a @ b."""
|
||||
@@ -110,10 +150,16 @@ class DataExecutor:
|
||||
dtype_in = p.get("dtype_in", "f16")
|
||||
dtype_out = p.get("dtype_out", dtype_in)
|
||||
|
||||
a = self.store.read(src_a_space, p["src_a_addr"],
|
||||
shape=p.get("shape_a"), dtype=dtype_in)
|
||||
b = self.store.read(src_b_space, p["src_b_addr"],
|
||||
shape=p.get("shape_b"), dtype=dtype_in)
|
||||
a = self._resolve_read(src_a_space, p["src_a_addr"],
|
||||
p.get("shape_a"), dtype_in, op.t_start)
|
||||
if a is None:
|
||||
a = self.store.read(src_a_space, p["src_a_addr"],
|
||||
shape=p.get("shape_a"), dtype=dtype_in)
|
||||
b = self._resolve_read(src_b_space, p["src_b_addr"],
|
||||
p.get("shape_b"), dtype_in, op.t_start)
|
||||
if b is None:
|
||||
b = self.store.read(src_b_space, p["src_b_addr"],
|
||||
shape=p.get("shape_b"), dtype=dtype_in)
|
||||
|
||||
# Compute in higher precision if specified
|
||||
dtype_acc = p.get("dtype_acc", "f32")
|
||||
@@ -150,8 +196,11 @@ class DataExecutor:
|
||||
):
|
||||
if snap is not None:
|
||||
inputs.append(snap)
|
||||
else:
|
||||
inputs.append(self.store.read(space, addr, shape=shape, dtype=idtype))
|
||||
continue
|
||||
resolved = self._resolve_read(space, addr, shape, idtype, op.t_start)
|
||||
if resolved is None:
|
||||
resolved = self.store.read(space, addr, shape=shape, dtype=idtype)
|
||||
inputs.append(resolved)
|
||||
|
||||
result = _compute_math(math_op, inputs, p.get("axis"))
|
||||
if result is not None:
|
||||
|
||||
@@ -96,13 +96,20 @@ class OpLogger:
|
||||
# gets reused on the next ring round).
|
||||
if self._memory_store is not None:
|
||||
if op_kind == "math":
|
||||
handle_snaps = params.get("input_handle_data") or ()
|
||||
snaps: list[Any] = []
|
||||
for addr, shape, space, idtype in zip(
|
||||
for i, (addr, shape, space, idtype) in enumerate(zip(
|
||||
params.get("input_addrs", []),
|
||||
params.get("input_shapes", []),
|
||||
params.get("input_spaces", []),
|
||||
params.get("input_dtypes", []),
|
||||
):
|
||||
)):
|
||||
if i < len(handle_snaps) and handle_snaps[i] is not None:
|
||||
carried = handle_snaps[i]
|
||||
snaps.append(
|
||||
carried.copy() if hasattr(carried, "copy") else carried
|
||||
)
|
||||
continue
|
||||
try:
|
||||
arr = self._memory_store.read(
|
||||
space, addr, shape=shape, dtype=idtype,
|
||||
@@ -111,6 +118,7 @@ class OpLogger:
|
||||
except Exception:
|
||||
snaps.append(None)
|
||||
params["input_snapshots"] = snaps
|
||||
params.pop("input_handle_data", None)
|
||||
elif op_name == "dma_write":
|
||||
# ADR-0027 fix: only snapshot HBM sources. TCM (PE scratch)
|
||||
# sources are repopulated by Phase 2 math/gemm replay —
|
||||
@@ -222,6 +230,7 @@ def _extract_op_info(msg: Any) -> tuple[str, str, dict[str, Any]]:
|
||||
"input_shapes": [h.shape for h in msg.inputs],
|
||||
"input_spaces": [getattr(h, "space", "tcm") for h in msg.inputs],
|
||||
"input_dtypes": [h.dtype for h in msg.inputs],
|
||||
"input_handle_data": tuple(getattr(h, "data", None) for h in msg.inputs),
|
||||
"dst_addr": msg.out.addr,
|
||||
"dst_space": getattr(msg.out, "space", "tcm"),
|
||||
"shape_out": msg.out.shape,
|
||||
|
||||
@@ -0,0 +1,198 @@
|
||||
"""End-to-end engine drives for the four GQA Llama-70B panels (sub-cycle 4c step 2).
|
||||
|
||||
Mirrors the existing single_user_decode diag harness across all four panels
|
||||
of the milestone-gqa-llama70b sweep (ADR-0057):
|
||||
|
||||
single_user_prefill ring-K/V kernel, intracube PE ring (8 PEs / 1 cube)
|
||||
single_user_decode allreduce-mlo kernel, intracube PE ring
|
||||
multi_user_prefill ring-K/V kernel, intercube multisip (4 cubes)
|
||||
multi_user_decode allreduce-mlo kernel, intercube multisip
|
||||
|
||||
Each test runs the panel through ``run_bench`` with ``enable_data=True``
|
||||
and asserts ``result.completion.ok``. Failures dump the engine's op_log
|
||||
tail and the exception, mirroring the decode-diag harness format.
|
||||
|
||||
Validation-scale config matches ADR-0057 D4:
|
||||
S_q_prefill=16, S_kv_per_rank=16, h_q=h_kv=1, d_head=64
|
||||
n_ranks_single_user=8, n_ranks_multi_user=4
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import traceback
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
from kernbench.benches._attention_mesh_kv import attention_mesh_kv_kernel
|
||||
from kernbench.benches._attention_mesh_mlo import attention_mesh_mlo_kernel
|
||||
from kernbench.ccl.install import load_ccl_config, resolve_algorithm_config
|
||||
from kernbench.ccl.sfr_config import (
|
||||
configure_sfr_intercube_multisip,
|
||||
configure_sfr_intracube_pe_ring,
|
||||
)
|
||||
from kernbench.policy.placement.dp import DPPolicy
|
||||
from kernbench.runtime_api.bench_runner import run_bench
|
||||
from kernbench.runtime_api.types import resolve_device
|
||||
from kernbench.sim_engine.engine import GraphEngine
|
||||
from kernbench.topology.builder import resolve_topology
|
||||
|
||||
TOPOLOGY_PATH = Path(__file__).resolve().parents[2] / "topology.yaml"
|
||||
|
||||
S_Q_PREFILL = 16
|
||||
S_Q_DECODE = 1
|
||||
S_KV_PER_RANK = 16
|
||||
H_Q = 1
|
||||
H_KV = 1
|
||||
D_HEAD = 64
|
||||
N_RANKS_SINGLE_USER = 8
|
||||
N_RANKS_MULTI_USER = 4
|
||||
DTYPE = "f16"
|
||||
|
||||
|
||||
# ── Helpers ──────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def _engine_factory(t, d):
|
||||
return GraphEngine(getattr(t, "topology_obj", t), enable_data=True)
|
||||
|
||||
|
||||
def _run_panel(bench_fn):
|
||||
"""Drive a panel through run_bench; return (exc, result, engine)."""
|
||||
topo = resolve_topology(str(TOPOLOGY_PATH))
|
||||
captured: dict = {"engine": None}
|
||||
|
||||
def factory(t, d):
|
||||
eng = _engine_factory(t, d)
|
||||
captured["engine"] = eng
|
||||
return eng
|
||||
|
||||
exc = None
|
||||
result = None
|
||||
try:
|
||||
result = run_bench(
|
||||
topology=topo, bench_fn=bench_fn,
|
||||
device=resolve_device(None), engine_factory=factory,
|
||||
)
|
||||
except BaseException as e: # noqa: BLE001
|
||||
exc = e
|
||||
return exc, result, captured["engine"]
|
||||
|
||||
|
||||
def _assert_ok(name: str, exc, result, engine) -> None:
|
||||
if exc is not None:
|
||||
oplog_len = len(getattr(engine, "op_log", []) or []) if engine else 0
|
||||
print(f"\n========== {name} FAIL ==========")
|
||||
print(f"op_log records before crash: {oplog_len}")
|
||||
print(f"{type(exc).__name__}: {exc}")
|
||||
traceback.print_exception(type(exc), exc, exc.__traceback__)
|
||||
raise AssertionError(
|
||||
f"{name} failed at runtime: {exc}"
|
||||
) from exc
|
||||
assert result is not None, f"{name}: no result"
|
||||
assert result.completion.ok, f"{name}: completion not ok — {result.completion}"
|
||||
|
||||
|
||||
# ── Panel bench fns ──────────────────────────────────────────────
|
||||
|
||||
|
||||
def _bench_fn_single_user_prefill(ctx):
|
||||
configure_sfr_intracube_pe_ring(
|
||||
ctx.engine, ctx.spec,
|
||||
resolve_algorithm_config(load_ccl_config(), name="lrab_hierarchical_allreduce"),
|
||||
)
|
||||
n = N_RANKS_SINGLE_USER
|
||||
dp_full = DPPolicy(cube="replicate", pe="replicate", num_cubes=1, num_pes=n)
|
||||
dp_kv = DPPolicy(cube="replicate", pe="row_wise", num_cubes=1, num_pes=n)
|
||||
q = ctx.zeros((S_Q_PREFILL, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="q")
|
||||
k = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="k")
|
||||
v = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="v")
|
||||
o = ctx.empty((S_Q_PREFILL, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="o")
|
||||
ctx.launch(
|
||||
"single_user_prefill_mesh", attention_mesh_kv_kernel,
|
||||
q, k, v, o,
|
||||
S_Q_PREFILL, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, n,
|
||||
)
|
||||
|
||||
|
||||
def _bench_fn_single_user_decode(ctx):
|
||||
configure_sfr_intracube_pe_ring(
|
||||
ctx.engine, ctx.spec,
|
||||
resolve_algorithm_config(load_ccl_config(), name="lrab_hierarchical_allreduce"),
|
||||
)
|
||||
n = N_RANKS_SINGLE_USER
|
||||
dp_full = DPPolicy(cube="replicate", pe="replicate", num_cubes=1, num_pes=n)
|
||||
dp_kv = DPPolicy(cube="replicate", pe="row_wise", num_cubes=1, num_pes=n)
|
||||
q = ctx.zeros((S_Q_DECODE, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="q")
|
||||
k = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="k")
|
||||
v = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="v")
|
||||
o = ctx.empty((S_Q_DECODE, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="o")
|
||||
ctx.launch(
|
||||
"single_user_decode_mesh", attention_mesh_mlo_kernel,
|
||||
q, k, v, o,
|
||||
S_Q_DECODE, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, n,
|
||||
)
|
||||
|
||||
|
||||
def _bench_fn_multi_user_prefill(ctx):
|
||||
configure_sfr_intercube_multisip(
|
||||
ctx.engine, ctx.spec,
|
||||
resolve_algorithm_config(load_ccl_config(), name="lrab_hierarchical_allreduce"),
|
||||
)
|
||||
n = N_RANKS_MULTI_USER
|
||||
dp_full = DPPolicy(cube="replicate", pe="replicate", num_cubes=n, num_pes=8)
|
||||
dp_kv = DPPolicy(cube="row_wise", pe="replicate", num_cubes=n, num_pes=8)
|
||||
q = ctx.zeros((S_Q_PREFILL, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="q")
|
||||
k = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="k")
|
||||
v = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="v")
|
||||
o = ctx.empty((S_Q_PREFILL, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="o")
|
||||
ctx.launch(
|
||||
"multi_user_prefill_mesh", attention_mesh_kv_kernel,
|
||||
q, k, v, o,
|
||||
S_Q_PREFILL, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, n,
|
||||
1, # rank_axis=1 → ring at cube level (ADR-0059 multi_user)
|
||||
_auto_dim_remap=False,
|
||||
)
|
||||
|
||||
|
||||
def _bench_fn_multi_user_decode(ctx):
|
||||
configure_sfr_intercube_multisip(
|
||||
ctx.engine, ctx.spec,
|
||||
resolve_algorithm_config(load_ccl_config(), name="lrab_hierarchical_allreduce"),
|
||||
)
|
||||
n = N_RANKS_MULTI_USER
|
||||
dp_full = DPPolicy(cube="replicate", pe="replicate", num_cubes=n, num_pes=8)
|
||||
dp_kv = DPPolicy(cube="row_wise", pe="replicate", num_cubes=n, num_pes=8)
|
||||
q = ctx.zeros((S_Q_DECODE, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="q")
|
||||
k = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="k")
|
||||
v = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="v")
|
||||
o = ctx.empty((S_Q_DECODE, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="o")
|
||||
ctx.launch(
|
||||
"multi_user_decode_mesh", attention_mesh_mlo_kernel,
|
||||
q, k, v, o,
|
||||
S_Q_DECODE, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, n,
|
||||
1, # rank_axis=1 → ring at cube level (ADR-0059 multi_user)
|
||||
_auto_dim_remap=False,
|
||||
)
|
||||
|
||||
|
||||
# ── Tests ────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def test_single_user_prefill_through_engine():
|
||||
exc, result, engine = _run_panel(_bench_fn_single_user_prefill)
|
||||
_assert_ok("single_user_prefill", exc, result, engine)
|
||||
|
||||
|
||||
def test_single_user_decode_through_engine():
|
||||
exc, result, engine = _run_panel(_bench_fn_single_user_decode)
|
||||
_assert_ok("single_user_decode", exc, result, engine)
|
||||
|
||||
|
||||
def test_multi_user_prefill_through_engine():
|
||||
exc, result, engine = _run_panel(_bench_fn_multi_user_prefill)
|
||||
_assert_ok("multi_user_prefill", exc, result, engine)
|
||||
|
||||
|
||||
def test_multi_user_decode_through_engine():
|
||||
exc, result, engine = _run_panel(_bench_fn_multi_user_decode)
|
||||
_assert_ok("multi_user_decode", exc, result, engine)
|
||||
@@ -0,0 +1,172 @@
|
||||
"""Phase 1 spec test for ``rank_axis`` parameter on the two mesh kernels.
|
||||
|
||||
ADR-0059's mesh kernels currently hard-code ``rank = tl.program_id(axis=0)``,
|
||||
which only works for single_user_* panels (rank == pe_id within cube).
|
||||
For multi_user_* panels the ring is at the cube level — rank should be
|
||||
``cube_id`` (axis=1), and the 7 non-rank-leader PEs in each cube should
|
||||
not run the ring (they only hold KV replicas).
|
||||
|
||||
This test pins the desired ``rank_axis`` kwarg semantics:
|
||||
|
||||
rank_axis = 0 (default, single_user)
|
||||
rank = tl.program_id(axis=0). Every PE in the cube runs the ring.
|
||||
Existing behavior — no change.
|
||||
|
||||
rank_axis = 1 (multi_user)
|
||||
if tl.program_id(axis=0) != 0: return. (7/8 PEs early-exit.)
|
||||
rank = tl.program_id(axis=1).
|
||||
|
||||
Phase 1 expectation: tests fail today (kernels don't accept the kwarg).
|
||||
Phase 2 lands the parameter on both kernels; tests turn green and the
|
||||
multi_user_* diag harness clears its first send.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
from kernbench.common.ipcq_types import IpcqRecvCmd, IpcqSendCmd
|
||||
from kernbench.common.pe_commands import GemmCmd
|
||||
from kernbench.triton_emu.tl_context import TLContext, run_kernel
|
||||
|
||||
from kernbench.benches._attention_mesh_kv import attention_mesh_kv_kernel
|
||||
from kernbench.benches._attention_mesh_mlo import attention_mesh_mlo_kernel
|
||||
|
||||
S_Q_PREFILL = 16
|
||||
S_Q_DECODE = 1
|
||||
S_KV_PER_RANK = 16
|
||||
H_Q = 1
|
||||
H_KV = 1
|
||||
D_HEAD = 64
|
||||
N_RANKS_MULTI = 4
|
||||
PES_PER_CUBE = 8
|
||||
|
||||
Q_PTR = 0x10000
|
||||
K_PTR = 0x20000
|
||||
V_PTR = 0x30000
|
||||
O_PTR = 0x40000
|
||||
|
||||
|
||||
def _tl(pe_id: int, cube_id: int, num_pes: int, num_cubes: int) -> TLContext:
|
||||
return TLContext(
|
||||
pe_id=pe_id,
|
||||
num_programs=num_pes,
|
||||
cube_id=cube_id,
|
||||
num_cubes=num_cubes,
|
||||
dispatch_cycles=0,
|
||||
scratch_base=0x80000,
|
||||
scratch_size=1 << 20,
|
||||
)
|
||||
|
||||
|
||||
# ── Default rank_axis=0 backward-compat ──────────────────────────
|
||||
|
||||
|
||||
def test_mlo_kernel_default_rank_axis_zero_emits_commands_on_all_pes():
|
||||
"""rank_axis defaults to 0 → kernel uses pe_id as rank, runs on every
|
||||
PE. Verify by running rank=3 (interior PE) in a single-cube 8-rank
|
||||
setup and asserting at least one GEMM and at least one IPCQ send
|
||||
are emitted (interior ranks send in both directions)."""
|
||||
tl = _tl(pe_id=3, cube_id=0, num_pes=8, num_cubes=1)
|
||||
run_kernel(
|
||||
attention_mesh_mlo_kernel, tl,
|
||||
Q_PTR, K_PTR, V_PTR, O_PTR,
|
||||
S_Q_DECODE, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, 8,
|
||||
)
|
||||
assert any(isinstance(c, GemmCmd) for c in tl.commands), \
|
||||
"default rank_axis=0 must run the kernel (≥1 GEMM)"
|
||||
assert any(isinstance(c, IpcqSendCmd) for c in tl.commands), \
|
||||
"interior rank must emit ≥1 IpcqSendCmd"
|
||||
|
||||
|
||||
def test_kv_kernel_default_rank_axis_zero_emits_commands_on_all_pes():
|
||||
tl = _tl(pe_id=3, cube_id=0, num_pes=8, num_cubes=1)
|
||||
run_kernel(
|
||||
attention_mesh_kv_kernel, tl,
|
||||
Q_PTR, K_PTR, V_PTR, O_PTR,
|
||||
S_Q_PREFILL, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, 8,
|
||||
)
|
||||
assert any(isinstance(c, GemmCmd) for c in tl.commands)
|
||||
assert any(isinstance(c, IpcqSendCmd) for c in tl.commands)
|
||||
|
||||
|
||||
# ── rank_axis=1 multi_user semantics ─────────────────────────────
|
||||
|
||||
|
||||
def test_mlo_kernel_rank_axis_one_gates_non_zero_pe_to_no_commands():
|
||||
"""rank_axis=1 + pe_id != 0 → kernel must early-return; no GEMM,
|
||||
no DMA, no IPCQ. The 7 non-rank-leader PEs in a multi_user cube
|
||||
must stay completely silent so the cube-level SFR install isn't
|
||||
asked to route sends from PEs that have no neighbors installed."""
|
||||
tl = _tl(pe_id=2, cube_id=1, num_pes=PES_PER_CUBE, num_cubes=N_RANKS_MULTI)
|
||||
run_kernel(
|
||||
attention_mesh_mlo_kernel, tl,
|
||||
Q_PTR, K_PTR, V_PTR, O_PTR,
|
||||
S_Q_DECODE, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, N_RANKS_MULTI,
|
||||
rank_axis=1,
|
||||
)
|
||||
assert not any(isinstance(c, GemmCmd) for c in tl.commands), \
|
||||
"pe_id=2 with rank_axis=1 must not emit GEMMs"
|
||||
assert not any(isinstance(c, IpcqSendCmd) for c in tl.commands), \
|
||||
"pe_id=2 with rank_axis=1 must not emit IpcqSendCmd"
|
||||
assert not any(isinstance(c, IpcqRecvCmd) for c in tl.commands), \
|
||||
"pe_id=2 with rank_axis=1 must not emit IpcqRecvCmd"
|
||||
|
||||
|
||||
def test_kv_kernel_rank_axis_one_gates_non_zero_pe_to_no_commands():
|
||||
tl = _tl(pe_id=2, cube_id=1, num_pes=PES_PER_CUBE, num_cubes=N_RANKS_MULTI)
|
||||
run_kernel(
|
||||
attention_mesh_kv_kernel, tl,
|
||||
Q_PTR, K_PTR, V_PTR, O_PTR,
|
||||
S_Q_PREFILL, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, N_RANKS_MULTI,
|
||||
rank_axis=1,
|
||||
)
|
||||
assert not any(isinstance(c, GemmCmd) for c in tl.commands)
|
||||
assert not any(isinstance(c, IpcqSendCmd) for c in tl.commands)
|
||||
assert not any(isinstance(c, IpcqRecvCmd) for c in tl.commands)
|
||||
|
||||
|
||||
def test_mlo_kernel_rank_axis_one_pe_zero_uses_cube_id_as_rank():
|
||||
"""rank_axis=1 + pe_id == 0 → kernel runs the ring with rank=cube_id.
|
||||
For cube_id=1 in a 4-cube ring, rank=1 is an interior rank: has_E=True
|
||||
AND has_W=True → IPCQ sends emitted in both E and W directions.
|
||||
"""
|
||||
tl = _tl(pe_id=0, cube_id=1, num_pes=PES_PER_CUBE, num_cubes=N_RANKS_MULTI)
|
||||
run_kernel(
|
||||
attention_mesh_mlo_kernel, tl,
|
||||
Q_PTR, K_PTR, V_PTR, O_PTR,
|
||||
S_Q_DECODE, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, N_RANKS_MULTI,
|
||||
rank_axis=1,
|
||||
)
|
||||
sends = [c for c in tl.commands if isinstance(c, IpcqSendCmd)]
|
||||
assert any(s.direction == "E" for s in sends), \
|
||||
"cube_id=1 (interior) must emit ≥1 E-send"
|
||||
assert any(s.direction == "W" for s in sends), \
|
||||
"cube_id=1 (interior) must emit ≥1 W-send"
|
||||
|
||||
|
||||
def test_kv_kernel_rank_axis_one_pe_zero_uses_cube_id_as_rank():
|
||||
tl = _tl(pe_id=0, cube_id=1, num_pes=PES_PER_CUBE, num_cubes=N_RANKS_MULTI)
|
||||
run_kernel(
|
||||
attention_mesh_kv_kernel, tl,
|
||||
Q_PTR, K_PTR, V_PTR, O_PTR,
|
||||
S_Q_PREFILL, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, N_RANKS_MULTI,
|
||||
rank_axis=1,
|
||||
)
|
||||
sends = [c for c in tl.commands if isinstance(c, IpcqSendCmd)]
|
||||
assert any(s.direction == "E" for s in sends)
|
||||
assert any(s.direction == "W" for s in sends)
|
||||
|
||||
|
||||
def test_mlo_kernel_rank_axis_one_west_edge_cube_no_west_sends():
|
||||
"""cube_id=0 (west edge) with rank_axis=1: rank=0, has_W=False → no
|
||||
W-direction IPCQ sends. has_E=True → ≥1 E-direction send."""
|
||||
tl = _tl(pe_id=0, cube_id=0, num_pes=PES_PER_CUBE, num_cubes=N_RANKS_MULTI)
|
||||
run_kernel(
|
||||
attention_mesh_mlo_kernel, tl,
|
||||
Q_PTR, K_PTR, V_PTR, O_PTR,
|
||||
S_Q_DECODE, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, N_RANKS_MULTI,
|
||||
rank_axis=1,
|
||||
)
|
||||
sends = [c for c in tl.commands if isinstance(c, IpcqSendCmd)]
|
||||
assert any(s.direction == "E" for s in sends), \
|
||||
"west-edge cube_id=0 must still emit ≥1 E-send"
|
||||
assert not any(s.direction == "W" for s in sends), \
|
||||
"west-edge cube_id=0 must NOT emit any W-send (no W neighbor)"
|
||||
@@ -1,283 +1,31 @@
|
||||
"""Shared plotting plumbing for the GEMM figure tests.
|
||||
"""Thin re-export shim for the GEMM figure tests.
|
||||
|
||||
Not a test module (no ``test_`` prefix -> pytest does not collect it).
|
||||
Not a test module (no ``test_`` prefix → pytest does not collect it).
|
||||
|
||||
Reads the committed ``docs/diagrams/gemm_sweep.json`` (produced by the heavy
|
||||
``scripts/gemm_sweep.py`` sim sweep) and renders matplotlib PNGs into
|
||||
``docs/diagrams/gemm_plots/``. No simulation here -> the figure tests are fast
|
||||
and run by default; regenerating the underlying data stays a manual script.
|
||||
|
||||
Chart set (mirrors the GEMM MAC slides in scripts/build_overview_slides.py):
|
||||
- stage breakdown (load_ref operand staging)
|
||||
- MAC utilization — measured (load_ref)
|
||||
- MAC utilization — theoretical vs measured (load_ref)
|
||||
The sweep + renderer logic now lives in
|
||||
``kernbench.benches.milestone_1h_gemm`` (production single home, ADR-0054,
|
||||
also driven by ``scripts/gemm_sweep.py``). The figure tests import the same
|
||||
names from here; behavior is unchanged (defaults still target
|
||||
``docs/diagrams/gemm_plots/``).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
from kernbench.benches.milestone_1h_gemm import (
|
||||
DEFAULT_PLOTS_DIR as GEMM_PLOTS_DIR,
|
||||
DEFAULT_SWEEP_JSON as GEMM_SWEEP_JSON,
|
||||
ROOT,
|
||||
emit_all_gemm_plots,
|
||||
emit_mac_utilization_measured,
|
||||
emit_mac_utilization_theoretical_vs_measured,
|
||||
emit_stage_breakdown,
|
||||
)
|
||||
|
||||
ROOT = Path(__file__).resolve().parent.parent.parent
|
||||
GEMM_SWEEP_JSON = ROOT / "docs" / "diagrams" / "gemm_sweep.json"
|
||||
GEMM_PLOTS_DIR = ROOT / "docs" / "diagrams" / "gemm_plots"
|
||||
|
||||
# Shapes excluded from the figures (mirrors build_overview_slides).
|
||||
EXCLUDED_SHAPES = {(512, 512, 512)}
|
||||
|
||||
# Stage bars shown (raw op_log stage_type keys) + display names + colors.
|
||||
STAGE_KEYS = ["DMA_READ", "FETCH", "GEMM", "DMA_WRITE"]
|
||||
STAGE_DISPLAY = {
|
||||
"DMA_READ": "DMA in",
|
||||
"FETCH": "Fetch",
|
||||
"GEMM": "GEMM",
|
||||
"DMA_WRITE": "DMA out",
|
||||
}
|
||||
STAGE_COLORS = {
|
||||
"DMA_READ": "#3B82F6",
|
||||
"FETCH": "#10B981",
|
||||
"GEMM": "#F59E0B",
|
||||
"DMA_WRITE": "#A855F7",
|
||||
}
|
||||
|
||||
# MAC-utilization model constants (mirror build_overview_slides).
|
||||
_HBM_GBS = 256.0
|
||||
_BPE = 2
|
||||
_T_STAGE = 16.0
|
||||
_D_STAGES = 3
|
||||
|
||||
_PLOT_VARIANT = "load_ref"
|
||||
|
||||
|
||||
def _load_sweep_data() -> dict:
|
||||
if not GEMM_SWEEP_JSON.exists():
|
||||
return {"rows": []}
|
||||
data = json.loads(GEMM_SWEEP_JSON.read_text())
|
||||
data["rows"] = [
|
||||
r for r in data.get("rows", [])
|
||||
if (r["M"], r["K"], r["N"]) not in EXCLUDED_SHAPES
|
||||
]
|
||||
return data
|
||||
|
||||
|
||||
def _shape_label(r: dict) -> str:
|
||||
if r["M"] == r["K"] == r["N"]:
|
||||
return f"M=K=N={r['M']}"
|
||||
return f"M={r['M']} K={r['K']} N={r['N']}"
|
||||
|
||||
|
||||
def _under_tile(M, K, N, tile_M, tile_K, tile_N) -> bool:
|
||||
return M < tile_M or K < tile_K or N < tile_N
|
||||
|
||||
|
||||
def _xtick_labels(shape_labels, tile_counts, flagged) -> list[str]:
|
||||
out = []
|
||||
for lbl, tc, fl in zip(shape_labels, tile_counts, flagged):
|
||||
s = f"{lbl}\n({tc} tiles)"
|
||||
if fl:
|
||||
s += " *"
|
||||
out.append(s)
|
||||
return out
|
||||
|
||||
|
||||
def _grouped_bar_png(
|
||||
out_name: str, *, title: str, subtitle: str | None,
|
||||
shape_labels, tile_counts, flagged, series: dict, colors: dict,
|
||||
y_label: str, threshold: float | None = None, footnote: str | None = None,
|
||||
) -> str:
|
||||
"""Render one grouped-bar chart to GEMM_PLOTS_DIR/out_name; return the path."""
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
|
||||
n_groups = len(shape_labels)
|
||||
n_series = max(1, len(series))
|
||||
x = np.arange(n_groups)
|
||||
width = 0.8 / n_series
|
||||
|
||||
fig, ax = plt.subplots(figsize=(11, 6))
|
||||
for i, (name, vals) in enumerate(series.items()):
|
||||
offset = (i - (n_series - 1) / 2) * width
|
||||
ax.bar(x + offset, vals, width, label=name, color=colors.get(name))
|
||||
|
||||
ax.set_xticks(x)
|
||||
ax.set_xticklabels(
|
||||
_xtick_labels(shape_labels, tile_counts, flagged), fontsize=8,
|
||||
)
|
||||
ax.set_ylabel(y_label)
|
||||
ax.set_title(title, fontsize=13, fontweight="bold")
|
||||
if subtitle:
|
||||
ax.text(0.5, 1.01, subtitle, transform=ax.transAxes, ha="center",
|
||||
va="bottom", fontsize=8, color="#475569")
|
||||
if threshold is not None:
|
||||
ax.axhline(threshold, ls="--", color="gray", lw=1.0)
|
||||
ax.legend(fontsize=8, loc="upper right")
|
||||
ax.grid(True, axis="y", alpha=0.3)
|
||||
|
||||
caption = "* = under-tile shape (M<TILE_M, K<TILE_K, or N<TILE_N)"
|
||||
if footnote:
|
||||
caption = footnote + "\n" + caption
|
||||
fig.text(0.5, 0.01, caption, ha="center", fontsize=7, color="gray",
|
||||
wrap=True)
|
||||
|
||||
fig.tight_layout(rect=(0, 0.05, 1, 1))
|
||||
GEMM_PLOTS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
out = GEMM_PLOTS_DIR / out_name
|
||||
fig.savefig(out, dpi=120)
|
||||
plt.close(fig)
|
||||
return str(out)
|
||||
|
||||
|
||||
# ── individual chart renderers (read sweep JSON, emit one PNG each) ─────
|
||||
|
||||
|
||||
def emit_stage_breakdown() -> str | None:
|
||||
"""Per-stage engine wall-clock per shape (load_ref operand staging)."""
|
||||
data = _load_sweep_data()
|
||||
rows = [r for r in data["rows"] if r.get("variant") == _PLOT_VARIANT]
|
||||
if not rows:
|
||||
return None
|
||||
tile = data["tile_sizes"]
|
||||
shape_labels = [_shape_label(r) for r in rows]
|
||||
flagged = [_under_tile(r["M"], r["K"], r["N"], tile["M"], tile["K"], tile["N"])
|
||||
for r in rows]
|
||||
tile_counts = [r["tile_count_expected"] for r in rows]
|
||||
series = {
|
||||
STAGE_DISPLAY[s]: [r.get("stages", {}).get(s, {}).get("wall_ns", 0.0)
|
||||
for r in rows]
|
||||
for s in STAGE_KEYS
|
||||
}
|
||||
colors = {STAGE_DISPLAY[s]: STAGE_COLORS[s] for s in STAGE_KEYS}
|
||||
return _grouped_bar_png(
|
||||
"gemm_stage_breakdown.png",
|
||||
title="GEMM stage breakdown",
|
||||
subtitle=(f"Per-stage engine wall-clock (DMA in / Fetch / GEMM / "
|
||||
f"DMA out), {_PLOT_VARIANT} staging. "
|
||||
f"Tile {tile['M']}x{tile['K']}x{tile['N']}."),
|
||||
shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
|
||||
series=series, colors=colors, y_label="ns",
|
||||
footnote="Bars = engine wall-clock interval (merged overlaps).",
|
||||
)
|
||||
|
||||
|
||||
def emit_mac_utilization_measured() -> str | None:
|
||||
"""GEMM util % and useful pipeline-eff % (analytical model, load_ref)."""
|
||||
data = _load_sweep_data()
|
||||
rows = data["rows"]
|
||||
if not rows:
|
||||
return None
|
||||
tile = data["tile_sizes"]
|
||||
TILE_M, TILE_K, TILE_N = tile["M"], tile["K"], tile["N"]
|
||||
tile_flops = 2 * TILE_M * TILE_K * TILE_N
|
||||
dma_w_per_pair = (TILE_M * TILE_N * _BPE) / _HBM_GBS
|
||||
head_ns = (_D_STAGES - 1) * _T_STAGE
|
||||
|
||||
by_shape = {(r["M"], r["K"], r["N"]): r
|
||||
for r in rows if r["variant"] == _PLOT_VARIANT}
|
||||
shapes = list(by_shape)
|
||||
if not shapes:
|
||||
return None
|
||||
shape_labels = [_shape_label(by_shape[k]) for k in shapes]
|
||||
flagged = [_under_tile(*k, TILE_M, TILE_K, TILE_N) for k in shapes]
|
||||
tile_counts = [by_shape[k]["tile_count_expected"] for k in shapes]
|
||||
|
||||
gemm_util, useful_eff = [], []
|
||||
for k in shapes:
|
||||
r = by_shape[k]
|
||||
M, K, N = r["M"], r["K"], r["N"]
|
||||
useful = 2 * M * K * N
|
||||
tiles = r["tile_count_expected"]
|
||||
gu = useful / (tile_flops * tiles) * 100
|
||||
gemm_util.append(gu)
|
||||
m_tiles = (M + TILE_M - 1) // TILE_M
|
||||
n_tiles = (N + TILE_N - 1) // TILE_N
|
||||
n_mn = m_tiles * n_tiles
|
||||
compute_total = tiles * _T_STAGE
|
||||
wall = head_ns + tiles * _T_STAGE + max(0, n_mn - 1) * dma_w_per_pair
|
||||
ueff = (compute_total * (gu / 100.0) / wall) * 100 if wall > 0 else 0.0
|
||||
useful_eff.append(ueff)
|
||||
|
||||
series = {"GEMM util %": gemm_util, "Useful eff %": useful_eff}
|
||||
colors = {"GEMM util %": "#10B981", "Useful eff %": "#F59E0B"}
|
||||
return _grouped_bar_png(
|
||||
"gemm_mac_utilization_measured.png",
|
||||
title="GEMM MAC utilization — load_ref",
|
||||
subtitle=("GEMM util = useful FLOPs / (tile FLOPs x tiles); "
|
||||
"Useful eff = GEMM util x ideal pipeline efficiency."),
|
||||
shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
|
||||
series=series, colors=colors, y_label="%", threshold=100.0,
|
||||
footnote="Theoretical ideal-pipeline model (not simulator data).",
|
||||
)
|
||||
|
||||
|
||||
def emit_mac_utilization_theoretical_vs_measured() -> str | None:
|
||||
"""Theoretical vs simulator-measured GEMM util / useful eff (load_ref)."""
|
||||
data = _load_sweep_data()
|
||||
rows = data["rows"]
|
||||
if not rows:
|
||||
return None
|
||||
tile = data["tile_sizes"]
|
||||
TILE_M, TILE_K, TILE_N = tile["M"], tile["K"], tile["N"]
|
||||
tile_flops = 2 * TILE_M * TILE_K * TILE_N
|
||||
dma_w_per_pair = (TILE_M * TILE_N * _BPE) / _HBM_GBS
|
||||
head_ns = (_D_STAGES - 1) * _T_STAGE
|
||||
peak_per_ns = tile_flops / _T_STAGE
|
||||
|
||||
by_shape = {(r["M"], r["K"], r["N"]): r
|
||||
for r in rows if r["variant"] == _PLOT_VARIANT}
|
||||
shapes = list(by_shape)
|
||||
if not shapes:
|
||||
return None
|
||||
shape_labels = [_shape_label(by_shape[k]) for k in shapes]
|
||||
flagged = [_under_tile(*k, TILE_M, TILE_K, TILE_N) for k in shapes]
|
||||
tile_counts = [by_shape[k]["tile_count_expected"] for k in shapes]
|
||||
|
||||
gu_t, gu_m, eff_t, eff_m = [], [], [], []
|
||||
for k in shapes:
|
||||
r = by_shape[k]
|
||||
M, K, N = r["M"], r["K"], r["N"]
|
||||
useful = 2 * M * K * N
|
||||
tiles = r["tile_count_expected"]
|
||||
gut = useful / (tile_flops * tiles)
|
||||
gu_t.append(gut * 100)
|
||||
rec = r.get("stages", {}).get("GEMM", {}).get("record_count", 0) or tiles
|
||||
gu_m.append((useful / (tile_flops * rec) * 100) if rec else 0.0)
|
||||
m_tiles = (M + TILE_M - 1) // TILE_M
|
||||
n_tiles = (N + TILE_N - 1) // TILE_N
|
||||
n_mn = m_tiles * n_tiles
|
||||
compute_total = tiles * _T_STAGE
|
||||
wall_t = head_ns + compute_total + max(0, n_mn - 1) * dma_w_per_pair
|
||||
eff_t.append((compute_total * gut / wall_t * 100) if wall_t > 0 else 0.0)
|
||||
cw = r.get("composite_window_ns", 0.0) or 0.0
|
||||
eff_m.append((useful / cw / peak_per_ns * 100) if cw > 0 else 0.0)
|
||||
|
||||
series = {
|
||||
"GEMM util % (theoretical)": gu_t,
|
||||
"GEMM util % (measured)": gu_m,
|
||||
"Theoretical eff %": eff_t,
|
||||
"Measured eff %": eff_m,
|
||||
}
|
||||
colors = {
|
||||
"GEMM util % (theoretical)": "#10B981",
|
||||
"GEMM util % (measured)": "#6EE7B7",
|
||||
"Theoretical eff %": "#F59E0B",
|
||||
"Measured eff %": "#3B82F6",
|
||||
}
|
||||
return _grouped_bar_png(
|
||||
"gemm_mac_utilization_theoretical_vs_measured.png",
|
||||
title="GEMM MAC utilization — theoretical vs measured (load_ref)",
|
||||
subtitle=("theoretical model vs simulator op_log; agreement "
|
||||
"validates the analytical pipeline model."),
|
||||
shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
|
||||
series=series, colors=colors, y_label="%", threshold=100.0,
|
||||
)
|
||||
|
||||
|
||||
def emit_all_gemm_plots() -> list[str]:
|
||||
"""Render every GEMM figure that has data; return the list of paths written."""
|
||||
paths = []
|
||||
for fn in (emit_stage_breakdown,
|
||||
emit_mac_utilization_measured,
|
||||
emit_mac_utilization_theoretical_vs_measured):
|
||||
p = fn()
|
||||
if p:
|
||||
paths.append(p)
|
||||
return paths
|
||||
__all__ = [
|
||||
"GEMM_PLOTS_DIR",
|
||||
"GEMM_SWEEP_JSON",
|
||||
"ROOT",
|
||||
"emit_all_gemm_plots",
|
||||
"emit_mac_utilization_measured",
|
||||
"emit_mac_utilization_theoretical_vs_measured",
|
||||
"emit_stage_breakdown",
|
||||
]
|
||||
|
||||
@@ -0,0 +1,131 @@
|
||||
"""Phase 1 spec test for ``ctx.launch`` dim-translation bugs surfaced by
|
||||
the multi_user_* panels of milestone-gqa-llama70b (sub-cycle 4c step 2).
|
||||
|
||||
The default ``topology.yaml`` has 4×4 = 16 cubes per SIP, so
|
||||
``RuntimeContext._num_cubes == 16``. Multi-user attention panels run a
|
||||
4-cube ring (validation scale) by passing ``DPPolicy(num_cubes=4)``.
|
||||
|
||||
Two bugs in ``ctx.launch`` make this combination silently produce wrong
|
||||
kernel arguments:
|
||||
|
||||
Bug A — _compute_local_shape ignores DPPolicy.num_cubes
|
||||
``_compute_local_shape`` in ``ctx.launch`` divides by
|
||||
``self._num_cubes`` (the topology's cube count, 16) instead of the
|
||||
DPPolicy's effective ``num_cubes`` (4). So a ``(M=80, K=64)`` tensor
|
||||
sharded ``cube="row_wise"`` with ``DPPolicy(num_cubes=4)`` produces
|
||||
a local M of ``80 // 16 = 5``, not the kernel-expected ``80 // 4 = 20``.
|
||||
Note: tensor allocation already honors ``dp.num_cubes`` correctly at
|
||||
[context.py:471-484](src/kernbench/runtime_api/context.py#L471-L484);
|
||||
the bug is the parallel computation inside ``launch`` is out of sync.
|
||||
|
||||
Bug B — scalar args coincidentally equal to a global tensor dim get auto-remapped
|
||||
The dim_map at [context.py:712-770](src/kernbench/runtime_api/context.py#L712-L770)
|
||||
is keyed by *value*, so any scalar whose value coincides with a
|
||||
global tensor dim gets rewritten to that dim's local value — even
|
||||
when the scalar is unrelated. ``d_head=64`` coincides with the
|
||||
multi_user K's global M = ``S_kv_per_rank * n = 16 * 4 = 64``, so
|
||||
the kernel receives ``d_head = 16`` (the post-Bug-A local) or
|
||||
``d_head = 4`` (the pre-Bug-A local) instead of ``64``.
|
||||
|
||||
Legacy bench kernels rely on auto-remap (e.g. ``test_va_offset.py``
|
||||
passes global N and expects the kernel to see local N). The fix is
|
||||
opt-out, not removal: ``ctx.launch(..., _auto_dim_remap=False)``
|
||||
preserves scalars exactly as passed, default behavior unchanged.
|
||||
|
||||
Both tests fail today. Phase 2 fixes them in [src/kernbench/runtime_api/context.py](src/kernbench/runtime_api/context.py).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
from kernbench.policy.placement.dp import DPPolicy
|
||||
from kernbench.runtime_api.context import RuntimeContext
|
||||
from kernbench.runtime_api.types import DeviceSelector
|
||||
from kernbench.sim_engine.engine import GraphEngine
|
||||
from kernbench.topology.builder import load_topology
|
||||
|
||||
TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
|
||||
|
||||
|
||||
def _make_ctx(corr_id: str) -> RuntimeContext:
|
||||
graph = load_topology(TOPOLOGY_PATH)
|
||||
engine = GraphEngine(graph)
|
||||
return RuntimeContext(
|
||||
engine=engine, target_device=DeviceSelector("sip:0"),
|
||||
correlation_id=corr_id, spec=graph.spec,
|
||||
)
|
||||
|
||||
|
||||
def test_topology_num_cubes_is_16_baseline_assumption():
|
||||
"""Sanity: confirm the topology this test assumes (16 cubes per SIP).
|
||||
If this fails, recheck the topology.yaml cube_mesh setting before
|
||||
interpreting the other failures below. ``_num_cubes`` is initialized
|
||||
lazily by ``_ensure_allocators`` on first tensor op, so trigger it."""
|
||||
ctx = _make_ctx("dim-baseline")
|
||||
ctx._ensure_allocators()
|
||||
assert ctx._num_cubes == 16, (
|
||||
f"expected default topology.yaml to give 16 cubes per SIP, "
|
||||
f"got {ctx._num_cubes}"
|
||||
)
|
||||
|
||||
|
||||
def test_ctx_launch_local_shape_honors_dppolicy_num_cubes():
|
||||
"""Bug A. ``DPPolicy(num_cubes=4)`` must be the divisor for
|
||||
row_wise sharding inside ctx.launch's dim_map, not the topology's 16.
|
||||
|
||||
Setup: K-like tensor with M_global = 80 (cleanly divisible by both
|
||||
4 and 16, distinct local values 20 vs 5). Pass M_global as a kernel
|
||||
scalar; the kernel records what it received. With correct dim_map,
|
||||
scalar 80 is remapped to 20 (80 / dp.num_cubes). With current code,
|
||||
it is remapped to 5 (80 / self._num_cubes = 16).
|
||||
"""
|
||||
captured: dict[str, int] = {}
|
||||
|
||||
def _kernel(t, m_scalar, *, tl): # noqa: ARG001
|
||||
captured["m_scalar"] = int(m_scalar)
|
||||
|
||||
ctx = _make_ctx("dim-bugA")
|
||||
dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=4, num_pes=8)
|
||||
t = ctx.zeros((80, 64), dtype="f16", dp=dp, name="t80x64")
|
||||
ctx.launch("bugA_capture", _kernel, t, 80)
|
||||
ctx.wait_all()
|
||||
|
||||
assert "m_scalar" in captured, "kernel was not invoked"
|
||||
assert captured["m_scalar"] == 20, (
|
||||
f"expected dim_map to divide 80 by dp.num_cubes=4 → 20; "
|
||||
f"got {captured['m_scalar']} (likely divided by topology cubes=16)"
|
||||
)
|
||||
|
||||
|
||||
def test_ctx_launch_scalar_passed_through_when_auto_remap_disabled():
|
||||
"""Bug B. Scalars must not be silently remapped when their value
|
||||
happens to equal a tensor's global dim — at minimum the caller must
|
||||
have an opt-out.
|
||||
|
||||
Setup: K-like tensor with M_global = 64 row_wise. Pass d_head = 64
|
||||
as a scalar (semantically unrelated to K's M, but coincidentally
|
||||
equal). The kernel records d_head. With ``_auto_dim_remap=False``
|
||||
on ctx.launch, d_head must stay 64.
|
||||
|
||||
Today: ``_auto_dim_remap`` kwarg doesn't exist → TypeError. After
|
||||
Phase 2: kwarg exists, defaults to True (legacy unchanged); passing
|
||||
False preserves the scalar.
|
||||
"""
|
||||
captured: dict[str, int] = {}
|
||||
|
||||
def _kernel(t, d_head, *, tl): # noqa: ARG001
|
||||
captured["d_head"] = int(d_head)
|
||||
|
||||
ctx = _make_ctx("dim-bugB")
|
||||
dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=4, num_pes=8)
|
||||
t = ctx.zeros((64, 64), dtype="f16", dp=dp, name="t64x64")
|
||||
ctx.launch(
|
||||
"bugB_capture", _kernel, t, 64,
|
||||
_auto_dim_remap=False,
|
||||
)
|
||||
ctx.wait_all()
|
||||
|
||||
assert captured.get("d_head") == 64, (
|
||||
f"expected d_head scalar to pass through unchanged when "
|
||||
f"_auto_dim_remap=False; got {captured.get('d_head')!r}"
|
||||
)
|
||||
@@ -0,0 +1,77 @@
|
||||
"""Milestone benches: registration + figure/result generation (ADR-0054).
|
||||
|
||||
``milestone-1h-gemm`` / ``milestone-1h-ccl`` are eval benches: run via the
|
||||
normal ``run_bench`` path, they regenerate every GEMM / allreduce figure +
|
||||
CSV into ``benches/1H_milestone_output/{gemm,ccl}/``. The GEMM bench in
|
||||
``MILESTONE_FAST=1`` mode just re-renders the committed sweep JSON (fast,
|
||||
default-run here); the CCL bench drives both full sweeps (slow, opt-in).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
from kernbench.benches.registry import resolve
|
||||
from kernbench.runtime_api.bench_runner import run_bench
|
||||
from kernbench.runtime_api.types import resolve_device
|
||||
from kernbench.sim_engine.engine import GraphEngine
|
||||
from kernbench.topology.builder import resolve_topology
|
||||
|
||||
import kernbench.benches.milestone_1h_ccl as ccl_bench
|
||||
import kernbench.benches.milestone_1h_gemm as gemm_bench
|
||||
|
||||
_NAME_RE = re.compile(r"^[a-z][a-z0-9]*(-[a-z0-9]+)*$")
|
||||
|
||||
|
||||
def _run(name: str):
|
||||
topo = resolve_topology("topology.yaml")
|
||||
return run_bench(
|
||||
topology=topo, bench_fn=resolve(name).run, device=resolve_device(None),
|
||||
engine_factory=lambda t, d: GraphEngine(
|
||||
getattr(t, "topology_obj", t), enable_data=True,
|
||||
),
|
||||
)
|
||||
|
||||
|
||||
def test_milestone_benches_registered():
|
||||
for name in ("milestone-1h-gemm", "milestone-1h-ccl"):
|
||||
spec = resolve(name)
|
||||
assert spec.name == name
|
||||
assert _NAME_RE.match(spec.name)
|
||||
assert spec.description.strip()
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
not gemm_bench.DEFAULT_SWEEP_JSON.exists(),
|
||||
reason="gemm_sweep.json absent; run scripts/gemm_sweep.py first",
|
||||
)
|
||||
def test_milestone_gemm_fast_generates_figures(monkeypatch):
|
||||
monkeypatch.setenv("MILESTONE_FAST", "1")
|
||||
result = _run("milestone-1h-gemm")
|
||||
assert result.completion.ok, result.completion
|
||||
out = gemm_bench._OUTPUT_DIR
|
||||
for png in (
|
||||
"gemm_stage_breakdown.png",
|
||||
"gemm_mac_utilization_measured.png",
|
||||
"gemm_mac_utilization_theoretical_vs_measured.png",
|
||||
):
|
||||
assert (out / png).exists(), f"missing {png}"
|
||||
|
||||
|
||||
@pytest.mark.slow
|
||||
def test_milestone_ccl_generates_figures():
|
||||
result = _run("milestone-1h-ccl")
|
||||
assert result.completion.ok, result.completion
|
||||
out = ccl_bench._OUTPUT_DIR
|
||||
for artifact in (
|
||||
"summary.csv",
|
||||
"topology.png",
|
||||
"comparison_mesh_vs_ring_vs_2DTorus_vs_theoretical_vs_fsim.png",
|
||||
"AllReduce_LRAB_2Dtorus_6SiP_2x3_with_TCM_SRAM_HBM.png",
|
||||
"AllReduce_LRAB_Ring1D_6SiP_1x6.png",
|
||||
"AllReduce_LRAB_2Dtorus_6SiP_2x3.png",
|
||||
"AllReduce_LRAB_2DMesh_6SiP_2x3.png",
|
||||
):
|
||||
assert (out / artifact).exists(), f"missing {artifact}"
|
||||
@@ -0,0 +1,218 @@
|
||||
"""Phase 1 spec test for the math-input snapshot race (IPCQ slot wrap).
|
||||
|
||||
Context (sub-cycle 4c.0 diagnostic):
|
||||
|
||||
The mesh decode kernel (_attention_mesh_mlo.py) issues many tl.recv()
|
||||
calls against an IPCQ ring of ~8 slots. With n_ranks=8 and bidirectional
|
||||
fan-out, each PE issues 3 recvs per step × 7 steps × 2 directions =
|
||||
42 recvs per panel. The IPCQ slot index is ``my_tail % n_slots``, so
|
||||
the ring wraps and a fresh recv overwrites a slot whose data a prior
|
||||
math op had not yet snapshotted.
|
||||
|
||||
OpLogger.record_end currently snapshots math inputs by re-reading
|
||||
MemoryStore at record_end time (op_log.py:97-113). When a later recv
|
||||
has overwritten the input addr with a DIFFERENT-shape array between
|
||||
record_start and record_end, MemoryStore.read raises
|
||||
``Shape mismatch: stored (16, 64) vs requested (16, 1)`` and the
|
||||
snapshot becomes None (or, in Phase 2 replay, surfaces the same
|
||||
exception in DataExecutor).
|
||||
|
||||
Phase 1 expectation: this test currently fails. It asserts the
|
||||
*desired* behavior: when the math input TensorHandle carries a
|
||||
.data snapshot (captured at recv time before the slot was wrapped),
|
||||
OpLogger MUST prefer that snapshot over MemoryStore.read.
|
||||
|
||||
After Phase 2 (snapshot propagation fix), this test passes — and the
|
||||
sub-cycle 4c.0 mesh decode end-to-end (test_attention_mesh_decode_diag
|
||||
and test_milestone_gqa_llama70b) passes for the same reason.
|
||||
|
||||
See: docs/adr/ADR-0020 (two-phase execution),
|
||||
docs/adr/ADR-0023 (IPCQ ring slots),
|
||||
docs/adr/ADR-0027 (snapshot discipline for dma_write).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import numpy as np
|
||||
|
||||
from kernbench.common.pe_commands import MathCmd, TensorHandle
|
||||
from kernbench.sim_engine.memory_store import MemoryStore
|
||||
from kernbench.sim_engine.op_log import OpLogger
|
||||
|
||||
|
||||
# ── Helpers ──────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def _slot_handle(addr: int, shape: tuple[int, ...], dtype: str,
|
||||
data: np.ndarray | None) -> TensorHandle:
|
||||
"""Build a TensorHandle as tl.recv() would: addr=slot, .data=snapshot."""
|
||||
nbytes = int(np.prod(shape)) * np.dtype(
|
||||
{"f16": np.float16, "f32": np.float32}[dtype]
|
||||
).itemsize
|
||||
return TensorHandle(
|
||||
id=f"slot_{addr:x}", addr=addr, shape=shape, dtype=dtype,
|
||||
nbytes=nbytes, data=data, space="tcm",
|
||||
)
|
||||
|
||||
|
||||
def _out_handle(addr: int, shape: tuple[int, ...], dtype: str) -> TensorHandle:
|
||||
nbytes = int(np.prod(shape)) * np.dtype(
|
||||
{"f16": np.float16, "f32": np.float32}[dtype]
|
||||
).itemsize
|
||||
return TensorHandle(
|
||||
id=f"out_{addr:x}", addr=addr, shape=shape, dtype=dtype,
|
||||
nbytes=nbytes, data=None, space="tcm",
|
||||
)
|
||||
|
||||
|
||||
# ── Tests ─────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def test_math_snapshot_lost_when_input_slot_overwritten_with_same_nbytes():
|
||||
"""Baseline (passes today): if a later write at the input addr has the
|
||||
SAME nbytes as the math input's expected shape, MemoryStore.read
|
||||
returns the LATER data — the snapshot is silently wrong. This is the
|
||||
quiet variant of the bug; it does not raise, it just produces
|
||||
incorrect numerical output in Phase 2.
|
||||
|
||||
This test documents that the current OpLogger behavior is wrong even
|
||||
when shapes coincidentally match. The Phase 2 fix removes this
|
||||
silent-corruption mode by preferring handle.data.
|
||||
"""
|
||||
store = MemoryStore()
|
||||
slot_addr = 0x3000
|
||||
# Original at recv time: filled with 7s.
|
||||
original = np.full((16, 1), 7.0, dtype=np.float16)
|
||||
store.write("tcm", slot_addr, original)
|
||||
|
||||
inp = _slot_handle(slot_addr, (16, 1), "f16", data=original.copy())
|
||||
out = _out_handle(0x4000, (16, 1), "f16")
|
||||
cmd = MathCmd(op="maximum", inputs=(inp,), out=out)
|
||||
|
||||
logger = OpLogger(memory_store=store)
|
||||
logger.record_start(10.0, "sip0.cube0.pe0.pe_math", cmd)
|
||||
|
||||
# SIMULATE: a later recv writes a DIFFERENT array at the same slot
|
||||
# (same nbytes as (16,1), so MemoryStore.read does not raise).
|
||||
later = np.full((16, 1), 99.0, dtype=np.float16)
|
||||
store.write("tcm", slot_addr, later)
|
||||
|
||||
logger.record_end(15.0, "sip0.cube0.pe0.pe_math", cmd)
|
||||
|
||||
snap = logger.records[0].params["input_snapshots"][0]
|
||||
assert snap is not None
|
||||
# Desired post-fix behavior: snapshot equals ``original``.
|
||||
# Today: snapshot equals ``later`` — silent corruption.
|
||||
np.testing.assert_array_equal(snap, original)
|
||||
|
||||
|
||||
def test_math_snapshot_survives_input_slot_wrap_with_different_shape():
|
||||
"""The hard-failure variant: a later recv overwrites the input slot
|
||||
with a DIFFERENT-shape array (different nbytes), so MemoryStore.read
|
||||
at record_end raises and the snapshot becomes None. Phase 2 replay
|
||||
then surfaces this as the (16, 64) vs (16, 1) crash seen in
|
||||
test_attention_mesh_decode_diag.
|
||||
|
||||
Desired behavior: handle.data carries the recv-time snapshot, so
|
||||
OpLogger never has to look at MemoryStore for this input → no race,
|
||||
snapshot is correct.
|
||||
"""
|
||||
store = MemoryStore()
|
||||
slot_addr = 0x3000
|
||||
|
||||
# Original at recv time: an (m, ℓ) reduction result, shape (16, 1).
|
||||
original = np.full((16, 1), 7.0, dtype=np.float16)
|
||||
store.write("tcm", slot_addr, original)
|
||||
|
||||
inp = _slot_handle(slot_addr, (16, 1), "f16", data=original.copy())
|
||||
out = _out_handle(0x4000, (16, 1), "f16")
|
||||
cmd = MathCmd(op="maximum", inputs=(inp,), out=out)
|
||||
|
||||
logger = OpLogger(memory_store=store)
|
||||
logger.record_start(10.0, "sip0.cube0.pe0.pe_math", cmd)
|
||||
|
||||
# SIMULATE the slot-wrap race: a later recv (an o triplet, shape
|
||||
# (16, 64)) writes the same TCM slot. MemoryStore.read for shape
|
||||
# (16, 1) now raises ValueError("Shape mismatch ...").
|
||||
overwrite = np.full((16, 64), 99.0, dtype=np.float16)
|
||||
store.write("tcm", slot_addr, overwrite)
|
||||
|
||||
logger.record_end(15.0, "sip0.cube0.pe0.pe_math", cmd)
|
||||
|
||||
snap = logger.records[0].params["input_snapshots"][0]
|
||||
# Today: snap is None (read raised, except branch returned None).
|
||||
# Post-fix: handle.data preferred → snap is original.
|
||||
assert snap is not None, (
|
||||
"input snapshot was lost when the recv slot was wrapped — "
|
||||
"OpLogger must prefer handle.data over MemoryStore.read for "
|
||||
"math inputs whose handle carries a .data snapshot"
|
||||
)
|
||||
assert snap.shape == (16, 1)
|
||||
np.testing.assert_array_equal(snap, original)
|
||||
|
||||
|
||||
def test_math_snapshot_handle_data_with_multiple_inputs():
|
||||
"""maximum/binary math has 2 inputs; both must use their carried
|
||||
snapshots independently (e.g. m_running merged with m_from_W where
|
||||
only m_from_W came from a recv slot)."""
|
||||
store = MemoryStore()
|
||||
|
||||
# Input 0: a running m value held in PE scratch (no .data; OpLogger
|
||||
# falls back to MemoryStore.read as today). Its addr is stable —
|
||||
# not subject to the slot-wrap race.
|
||||
scratch_addr = 0x5000
|
||||
m_running = np.full((16, 1), 3.0, dtype=np.float16)
|
||||
store.write("tcm", scratch_addr, m_running)
|
||||
inp0 = _slot_handle(scratch_addr, (16, 1), "f16", data=None)
|
||||
|
||||
# Input 1: m_from_W via tl.recv — carries snapshot in .data, addr
|
||||
# is the recv slot which WILL be wrapped before record_end.
|
||||
slot_addr = 0x3000
|
||||
m_from_W = np.full((16, 1), 7.0, dtype=np.float16)
|
||||
store.write("tcm", slot_addr, m_from_W)
|
||||
inp1 = _slot_handle(slot_addr, (16, 1), "f16", data=m_from_W.copy())
|
||||
|
||||
out = _out_handle(0x4000, (16, 1), "f16")
|
||||
cmd = MathCmd(op="maximum", inputs=(inp0, inp1), out=out)
|
||||
|
||||
logger = OpLogger(memory_store=store)
|
||||
logger.record_start(10.0, "sip0.cube0.pe0.pe_math", cmd)
|
||||
|
||||
# Slot 0x3000 gets wrapped by a later recv with a different shape.
|
||||
overwrite = np.full((16, 64), 99.0, dtype=np.float16)
|
||||
store.write("tcm", slot_addr, overwrite)
|
||||
|
||||
logger.record_end(15.0, "sip0.cube0.pe0.pe_math", cmd)
|
||||
|
||||
snaps = logger.records[0].params["input_snapshots"]
|
||||
assert len(snaps) == 2
|
||||
# Input 0 (no carried snapshot, addr stable): MemoryStore read still
|
||||
# works. This must keep working post-fix.
|
||||
assert snaps[0] is not None
|
||||
np.testing.assert_array_equal(snaps[0], m_running)
|
||||
# Input 1 (carried snapshot, slot wrapped): must come from .data.
|
||||
assert snaps[1] is not None
|
||||
assert snaps[1].shape == (16, 1)
|
||||
np.testing.assert_array_equal(snaps[1], m_from_W)
|
||||
|
||||
|
||||
def test_math_snapshot_falls_back_to_memory_store_when_handle_data_is_none():
|
||||
"""Backward-compat: handles with .data=None must continue to use
|
||||
MemoryStore.read as today. Most math inputs (intermediate results
|
||||
from local tl.dot / tl.exp etc.) have data=None and their TCM addrs
|
||||
are stable for the kernel's lifetime."""
|
||||
store = MemoryStore()
|
||||
addr = 0x6000
|
||||
arr = np.full((8, 8), 2.0, dtype=np.float16)
|
||||
store.write("tcm", addr, arr)
|
||||
|
||||
inp = _slot_handle(addr, (8, 8), "f16", data=None)
|
||||
out = _out_handle(0x7000, (8, 8), "f16")
|
||||
cmd = MathCmd(op="exp", inputs=(inp,), out=out)
|
||||
|
||||
logger = OpLogger(memory_store=store)
|
||||
logger.record_start(10.0, "sip0.cube0.pe0.pe_math", cmd)
|
||||
logger.record_end(15.0, "sip0.cube0.pe0.pe_math", cmd)
|
||||
|
||||
snap = logger.records[0].params["input_snapshots"][0]
|
||||
assert snap is not None
|
||||
np.testing.assert_array_equal(snap, arr)
|
||||
@@ -0,0 +1,334 @@
|
||||
"""Generate docs/adr/INDEX.md (and docs/adr-ko/INDEX.md) from the ADR corpus.
|
||||
|
||||
Auto-derives a section-based index following the same classification as
|
||||
the /report skill — Design Principles / High-level Architecture /
|
||||
Detailed Architecture (by component) / Implementation Decisions
|
||||
(by topic). Run before publishing to refresh INDEX.md.
|
||||
|
||||
The classification table below is the single source of truth. When a new
|
||||
ADR is added under docs/adr/, append an entry to ``CLASSIFICATION``. The
|
||||
script exits 1 if any ADR file is missing from the table or any title
|
||||
cannot be parsed, so omissions surface in CI.
|
||||
|
||||
Usage:
|
||||
python tools/generate_adr_index.py [--root <repo-root>] [--check]
|
||||
|
||||
--check : exit 1 if the generated INDEX differs from the on-disk file
|
||||
(used by CI to detect un-regenerated indexes).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import re
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
ADR_FILENAME_RE = re.compile(r"^ADR-(\d{4})-([a-z0-9_-]+)\.md$")
|
||||
# Title separator may be ":" (most ADRs) or "—" (em-dash; ADR-0033 uses
|
||||
# this). The verifier (tools/verify_adr_lang_pairs.py) only checks the
|
||||
# number, so both styles already coexist in the corpus.
|
||||
TITLE_RE = re.compile(r"^# ADR-(\d{4})\s*[:—]\s*(.+?)\s*$")
|
||||
|
||||
DESIGN_PRINCIPLES = "Design Principles"
|
||||
HIGH_LEVEL = "High-level Architecture"
|
||||
DETAILED = "Detailed Architecture"
|
||||
IMPL_DECISIONS = "Implementation Decisions"
|
||||
|
||||
|
||||
# (section, subgroup) per ADR. subgroup is used to sub-divide Detailed
|
||||
# (by component, see DETAILED_COMPONENTS) and Implementation (by topic).
|
||||
# Add a line here when introducing a new ADR.
|
||||
CLASSIFICATION: dict[int, tuple[str, str | None]] = {
|
||||
# Design Principles
|
||||
13: (DESIGN_PRINCIPLES, None),
|
||||
33: (DESIGN_PRINCIPLES, None),
|
||||
|
||||
# High-level Architecture
|
||||
3: (HIGH_LEVEL, "System hierarchy (Tray / SIP / CUBE / PE)"),
|
||||
7: (HIGH_LEVEL, "Runtime API ↔ sim_engine boundaries"),
|
||||
16: (HIGH_LEVEL, "IOChiplet NOC and memory data path"),
|
||||
17: (HIGH_LEVEL, "Cube NOC and HBM connectivity"),
|
||||
|
||||
# Detailed Architecture (subgroup matches DETAILED_COMPONENTS entries)
|
||||
14: (DETAILED, "pe_pipeline"), # covers pe_cpu/pe_dma/pe_fetch_store/pe_gemm/pe_math/pe_scheduler
|
||||
23: (DETAILED, "pe_ipcq"),
|
||||
34: (DETAILED, "hbm_ctrl"),
|
||||
35: (DETAILED, "m_cpu"),
|
||||
36: (DETAILED, "io_cpu"),
|
||||
37: (DETAILED, "forwarding"),
|
||||
38: (DETAILED, "pcie_ep"),
|
||||
39: (DETAILED, "pe_mmu"),
|
||||
40: (DETAILED, "pe_tcm"),
|
||||
41: (DETAILED, "sram"),
|
||||
42: (DETAILED, "tiling"),
|
||||
|
||||
# Implementation Decisions
|
||||
1: (IMPL_DECISIONS, "Address Scheme"),
|
||||
2: (IMPL_DECISIONS, "Routing & Helper API"),
|
||||
4: (IMPL_DECISIONS, "Memory Semantics & Local-HBM Bandwidth"),
|
||||
5: (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"),
|
||||
6: (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"),
|
||||
8: (IMPL_DECISIONS, "Tensor Deployment and Allocation"),
|
||||
9: (IMPL_DECISIONS, "Kernel Execution and Host-Device Messaging"),
|
||||
10: (IMPL_DECISIONS, "CLI Surface and Semantics"),
|
||||
11: (IMPL_DECISIONS, "Address Scheme"),
|
||||
12: (IMPL_DECISIONS, "Kernel Execution and Host-Device Messaging"),
|
||||
15: (IMPL_DECISIONS, "Component Port/Wire Fabric Model"),
|
||||
20: (IMPL_DECISIONS, "Two-Pass Data Execution"),
|
||||
22: (IMPL_DECISIONS, "2D Grid Program Identity"),
|
||||
24: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
|
||||
25: (IMPL_DECISIONS, "IPCQ Direction Addressing"),
|
||||
26: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
|
||||
27: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
|
||||
32: (IMPL_DECISIONS, "Intercube All-Reduce"),
|
||||
43: (IMPL_DECISIONS, "Evaluation Harnesses"),
|
||||
44: (IMPL_DECISIONS, "Evaluation Harnesses"),
|
||||
45: (IMPL_DECISIONS, "Bench Module Contract"),
|
||||
46: (IMPL_DECISIONS, "Kernel-side tl.* API (TLContext)"),
|
||||
47: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
|
||||
48: (IMPL_DECISIONS, "Memory Allocator Algorithms"),
|
||||
49: (IMPL_DECISIONS, "Probe Subcommand"),
|
||||
50: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
|
||||
51: (IMPL_DECISIONS, "Routing & Helper API"),
|
||||
52: (IMPL_DECISIONS, "Sim-engine Op Log and Memory Store Schemas"),
|
||||
53: (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"),
|
||||
54: (IMPL_DECISIONS, "Evaluation Harnesses"),
|
||||
}
|
||||
|
||||
# Canonical component order for the Detailed Architecture section.
|
||||
# Each entry: (component_name, list[ADR-numbers that cover it]).
|
||||
# Order matches src/kernbench/components/builtin/*.py alphabetical
|
||||
# (the same order /report uses).
|
||||
DETAILED_COMPONENTS: list[tuple[str, list[int]]] = [
|
||||
("forwarding", [37]),
|
||||
("hbm_ctrl", [34]),
|
||||
("io_cpu", [36]),
|
||||
("m_cpu", [35]),
|
||||
("pcie_ep", [38]),
|
||||
("pe_cpu", [14]),
|
||||
("pe_dma", [14, 23]),
|
||||
("pe_fetch_store", [14]),
|
||||
("pe_gemm", [14]),
|
||||
("pe_ipcq", [23]),
|
||||
("pe_math", [14]),
|
||||
("pe_mmu", [39]),
|
||||
("pe_scheduler", [14]),
|
||||
("pe_tcm", [40]),
|
||||
("sram", [41]),
|
||||
("tiling", [42]),
|
||||
]
|
||||
|
||||
|
||||
def _strip_bom(text: str) -> str:
|
||||
"""Strip leading UTF-8 BOM if present."""
|
||||
if text and ord(text[0]) == 0xFEFF:
|
||||
return text[1:]
|
||||
return text
|
||||
|
||||
|
||||
def _find_adrs(adr_dir: Path) -> list[tuple[int, str, Path]]:
|
||||
"""Return [(num, slug, path), ...] for ADR files in adr_dir, sorted by num."""
|
||||
out: list[tuple[int, str, Path]] = []
|
||||
for p in sorted(adr_dir.iterdir()):
|
||||
if not p.is_file():
|
||||
continue
|
||||
m = ADR_FILENAME_RE.match(p.name)
|
||||
if not m:
|
||||
continue
|
||||
out.append((int(m.group(1)), m.group(2), p))
|
||||
out.sort(key=lambda t: t[0])
|
||||
return out
|
||||
|
||||
|
||||
def _extract_title(path: Path) -> str:
|
||||
"""Parse the title from the first line `# ADR-NNNN: <title>`. Strips BOM."""
|
||||
text = _strip_bom(path.read_text(encoding="utf-8"))
|
||||
first_line = text.split("\n", 1)[0] if text else ""
|
||||
m = TITLE_RE.match(first_line)
|
||||
if not m:
|
||||
raise ValueError(
|
||||
f"{path.name}: cannot parse title from first line: {first_line!r}"
|
||||
)
|
||||
return m.group(2)
|
||||
|
||||
|
||||
def _build_index(adr_dir: Path, link_prefix: str) -> str:
|
||||
"""Build the INDEX.md text for adr_dir.
|
||||
|
||||
link_prefix is the relative href used for ADR links (e.g., ``./``
|
||||
so links resolve relative to the INDEX file location).
|
||||
"""
|
||||
adrs = _find_adrs(adr_dir)
|
||||
if not adrs:
|
||||
raise RuntimeError(f"No ADR files found under {adr_dir}")
|
||||
|
||||
# Validate every ADR is classified.
|
||||
missing = sorted(num for num, _slug, _ in adrs if num not in CLASSIFICATION)
|
||||
if missing:
|
||||
raise RuntimeError(
|
||||
"ADR(s) missing from CLASSIFICATION table in "
|
||||
"tools/generate_adr_index.py: "
|
||||
+ ", ".join(f"ADR-{n:04d}" for n in missing)
|
||||
+ ". Add an entry for each."
|
||||
)
|
||||
|
||||
# Map: num → (filename, title)
|
||||
num_to_meta: dict[int, tuple[str, str]] = {}
|
||||
for num, _slug, path in adrs:
|
||||
num_to_meta[num] = (path.name, _extract_title(path))
|
||||
|
||||
# ── Section assembly ────────────────────────────────────────────
|
||||
lines: list[str] = []
|
||||
lines.append("# ADR Index")
|
||||
lines.append("")
|
||||
lines.append(
|
||||
f"Auto-generated by `tools/generate_adr_index.py`. "
|
||||
f"Total ADRs: **{len(adrs)}**."
|
||||
)
|
||||
lines.append("")
|
||||
lines.append(
|
||||
"Classification mirrors the `/report` skill's section assignment. "
|
||||
"When adding a new ADR, also add an entry to the "
|
||||
"`CLASSIFICATION` table in `tools/generate_adr_index.py`."
|
||||
)
|
||||
lines.append("")
|
||||
|
||||
def fmt_entry(num: int) -> str:
|
||||
fname, title = num_to_meta[num]
|
||||
return f"- [ADR-{num:04d}]({link_prefix}{fname}) — {title}"
|
||||
|
||||
# Design Principles
|
||||
lines.append("## Design Principles")
|
||||
lines.append("")
|
||||
nums = sorted(n for n, (sec, _) in CLASSIFICATION.items()
|
||||
if sec == DESIGN_PRINCIPLES and n in num_to_meta)
|
||||
for n in nums:
|
||||
lines.append(fmt_entry(n))
|
||||
lines.append("")
|
||||
|
||||
# High-level Architecture (preserve declaration order via CLASSIFICATION dict's insertion order)
|
||||
lines.append("## High-level Architecture")
|
||||
lines.append("")
|
||||
nums = sorted(n for n, (sec, _) in CLASSIFICATION.items()
|
||||
if sec == HIGH_LEVEL and n in num_to_meta)
|
||||
for n in nums:
|
||||
sub = CLASSIFICATION[n][1] or ""
|
||||
fname, title = num_to_meta[n]
|
||||
if sub:
|
||||
lines.append(
|
||||
f"- [ADR-{n:04d}]({link_prefix}{fname}) — {title}"
|
||||
f" _({sub})_"
|
||||
)
|
||||
else:
|
||||
lines.append(fmt_entry(n))
|
||||
lines.append("")
|
||||
|
||||
# Detailed Architecture (canonical component order)
|
||||
lines.append("## Detailed Architecture")
|
||||
lines.append("")
|
||||
lines.append("One subsection per component file under `src/kernbench/components/builtin/`.")
|
||||
lines.append("")
|
||||
for comp, adr_nums in DETAILED_COMPONENTS:
|
||||
lines.append(f"### {comp}")
|
||||
lines.append("")
|
||||
if adr_nums:
|
||||
for n in adr_nums:
|
||||
if n not in num_to_meta:
|
||||
raise RuntimeError(
|
||||
f"DETAILED_COMPONENTS references ADR-{n:04d} for "
|
||||
f"'{comp}' but no such ADR file exists."
|
||||
)
|
||||
lines.append(fmt_entry(n))
|
||||
else:
|
||||
lines.append("_(no ADR coverage)_")
|
||||
lines.append("")
|
||||
|
||||
# Implementation Decisions — group by subgroup, preserving first-appearance order.
|
||||
lines.append("## Implementation Decisions")
|
||||
lines.append("")
|
||||
topic_order: list[str] = []
|
||||
topic_to_nums: dict[str, list[int]] = {}
|
||||
for n, (sec, sub) in CLASSIFICATION.items():
|
||||
if sec != IMPL_DECISIONS or n not in num_to_meta:
|
||||
continue
|
||||
topic = sub or "Uncategorized"
|
||||
if topic not in topic_to_nums:
|
||||
topic_order.append(topic)
|
||||
topic_to_nums[topic] = []
|
||||
topic_to_nums[topic].append(n)
|
||||
# Stable order: by smallest ADR-number in topic, so older infra appears first.
|
||||
topic_order.sort(key=lambda t: min(topic_to_nums[t]))
|
||||
for topic in topic_order:
|
||||
lines.append(f"### {topic}")
|
||||
lines.append("")
|
||||
for n in sorted(topic_to_nums[topic]):
|
||||
lines.append(fmt_entry(n))
|
||||
lines.append("")
|
||||
|
||||
return "\n".join(lines).rstrip() + "\n"
|
||||
|
||||
|
||||
def _check_or_write(path: Path, content: str, check: bool) -> bool:
|
||||
"""Write content to path, or compare in --check mode. Returns True on diff."""
|
||||
existing = path.read_text(encoding="utf-8") if path.exists() else ""
|
||||
if check:
|
||||
if existing != content:
|
||||
print(f"[diff] {path} would change.")
|
||||
return True
|
||||
return False
|
||||
path.write_text(content, encoding="utf-8")
|
||||
if existing != content:
|
||||
print(f"[wrote] {path}")
|
||||
else:
|
||||
print(f"[unchanged] {path}")
|
||||
return False
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
p = argparse.ArgumentParser(description=__doc__)
|
||||
p.add_argument(
|
||||
"--root", type=Path, default=Path.cwd(),
|
||||
help="Repository root (default: cwd)",
|
||||
)
|
||||
p.add_argument(
|
||||
"--check", action="store_true",
|
||||
help="Exit 1 if generated INDEX would differ from disk",
|
||||
)
|
||||
args = p.parse_args(argv)
|
||||
|
||||
en_dir = args.root / "docs" / "adr"
|
||||
ko_dir = args.root / "docs" / "adr-ko"
|
||||
|
||||
if not en_dir.is_dir():
|
||||
print(f"error: {en_dir} does not exist", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
any_diff = False
|
||||
try:
|
||||
en_index = _build_index(en_dir, link_prefix="./")
|
||||
except (RuntimeError, ValueError) as e:
|
||||
print(f"error (EN): {e}", file=sys.stderr)
|
||||
return 1
|
||||
any_diff |= _check_or_write(en_dir / "INDEX.md", en_index, args.check)
|
||||
|
||||
if ko_dir.is_dir():
|
||||
try:
|
||||
ko_index = _build_index(ko_dir, link_prefix="./")
|
||||
except (RuntimeError, ValueError) as e:
|
||||
print(f"error (KO): {e}", file=sys.stderr)
|
||||
return 1
|
||||
any_diff |= _check_or_write(ko_dir / "INDEX.md", ko_index, args.check)
|
||||
|
||||
if args.check and any_diff:
|
||||
print(
|
||||
"INDEX.md is out of date. "
|
||||
"Run `python tools/generate_adr_index.py` to refresh.",
|
||||
file=sys.stderr,
|
||||
)
|
||||
return 1
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||