9 Commits

Author SHA1 Message Date
mukesh 222815d374 attention: add rank_axis kwarg to mesh kernels for multi_user cube ring
ADR-0059 single_user_* panels run the ring across PEs in one cube
(rank == tl.program_id(axis=0)). multi_user_* panels run the ring
across cubes — rank should be cube_id (axis=1), and 7 of every 8 PEs
in each cube must stay silent because the cube-level SFR install only
gives the cube-coordinate PE 0 an E/W neighbor.

Add ``rank_axis: int = 0`` kwarg to both ``attention_mesh_mlo_kernel``
and ``attention_mesh_kv_kernel``:
  - 0 (default): rank == tl.program_id(axis=0). Existing single_user
    behavior, all spec tests unchanged.
  - 1: gate ``if tl.program_id(axis=0) != 0: return`` at kernel start,
    then ``rank = tl.program_id(axis=1)``. multi_user_* panels pass
    this to the kernel via ctx.launch positional arg.

Also brings in _attention_mesh_kv.py and _attention_mesh_mlo.py as
the committed home of the ADR-0059 kernels (previously living
uncommitted in the working tree from sub-cycle 4b).

Tests: 7-test rank_axis spec file (default-path + rank_axis=1 gating
and cube-id semantics, both kernels); 4-panel diag harness now green
end-to-end (single_user_prefill/decode + multi_user_prefill/decode);
763-test wider sweep clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-01 19:53:18 -07:00
mukesh d9e767d048 runtime_api: ctx.launch honors DPPolicy.num_cubes + adds _auto_dim_remap opt-out
Two compounding bugs in ctx.launch's dim-translation path surfaced
by multi_user_* panels of milestone-gqa-llama70b (sub-cycle 4c step 2):

Bug A: _compute_local_shape divided by self._num_cubes (the topology's
cube count, 16 in default topology.yaml) instead of the DPPolicy's
effective num_cubes (4 for validation-scale multi_user). The tensor
allocator at context.py:471-484 already honored dp.num_cubes; the
parallel computation inside launch was out of sync. Fix mirrors the
allocator's eff_num_cubes precedence pattern.

Bug B: dim_map was keyed by value, so any scalar whose value
coincidentally equaled a global tensor dim got rewritten to that dim's
local value — e.g. d_head=64 colliding with K's global M=64 in
multi_user mode. Legacy bench kernels (va_offset etc.) rely on this
remap, so the fix is opt-out: ctx.launch(..., _auto_dim_remap=False)
preserves scalars exactly as passed. Default remains True.

Tests: 3 new dim-translation tests + 4-panel diag harness covers
single_user_* (PASS) and multi_user_* (advances to new SFR/axis layer
failure, tracked separately). va_offset + full attention spec suite
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-01 19:33:40 -07:00
mukesh 313dee503c sim_engine: fix IPCQ slot-wrap snapshot race in Phase 2 replay
Phase 1 cannot snapshot math-output sources at outbound send time
because math executes only in Phase 2 — so token.data stays None and
PE_DMA inbound can't write the recv slot. For own-sends this is harmless
(Phase 2 replay reads the stable scratch addr after math runs). For
forwarded sends in mesh kernels (ADR-0059), src_addr is a recv slot
that gets wrapped by later inbounds before this read's Phase 2 turn,
yielding a shape mismatch on the fallback MemoryStore.read.

Fix: DataExecutor maintains a per-slot, time-ordered, shape-keyed
history. Every ipcq_copy write appends (t_write, value) to the slot's
history; _resolve_read falls back to the most recent shape-matching
entry with t_write <= the consuming op's t_start. Applied uniformly
to _execute_memory, _execute_gemm, and _execute_math.

Secondary: OpLogger.record_end for math ops now prefers
TensorHandle.data carried by the input handle over a MemoryStore
re-read, closing the smaller record-end race covered by the new
test_op_log_input_snapshot_race.py unit tests.

Tests: 4 new race tests + 6 existing op_log + mesh decode diag +
mesh kv/mlo spec — all green. Full repo sweep: 760 passed (3
pre-existing failures unrelated: bench-registry list drift +
Windows Tkinter env).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-01 19:14:09 -07:00
mukesh b1d6fafd3a eval: commit milestone bench output (track generated figures + results)
Per request, the milestone bench output is now tracked in git instead of
gitignored, so the figures/results are viewable on the remote:

- src/kernbench/benches/1H_milestone_output/gemm/  (3 PNGs + gemm_sweep.json)
- src/kernbench/benches/1H_milestone_output/ccl/   (3 per-topology PNGs,
  buffer-kind PNG+CSV, FSIM comparison PNG, topology.png, summary.csv)

Drop the .gitignore rule; update ADR-0054 D3 + Negative (EN+KO) to say the
output is committed (regenerable by rerunning the bench). Artifacts produced
by full bench runs (milestone-1h-gemm non-FAST, milestone-1h-ccl).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 15:37:27 -07:00
mukesh cc1bbd0ab7 eval: fold GEMM/allreduce harnesses into self-contained milestone benches
Move the GEMM + allreduce sweep/render logic out of scripts/ and tests/
into two self-contained eval benches so a user can regenerate every
result + figure with one command:

  kernbench run --bench milestone-1h-gemm   (MILESTONE_FAST=1 reuses JSON)
  kernbench run --bench milestone-1h-ccl

- benches/milestone_1h_{gemm,ccl}.py: single home for each domain; the
  run(torch) entry drives the sweeps and writes figures into
  benches/1H_milestone_output/{gemm,ccl}/ (gitignored), then submits a
  sentinel tensor to satisfy the run_bench contract.
- tests/gemm + tests/sccl helpers and scripts/gemm_sweep.py become thin
  re-export/wrapper shims over the benches (single source preserved); the
  pytest-only param builders + _run_distributed wrapper stay in the shim.
- eval-bench pattern: a bench may drive many configs + build its own
  per-config engines (extends ADR-0045 D5; reverses ADR-0044 D1/D2).

ADR-0054 (EN+KO) records the design; ADR-0043/0044/0045 + CLAUDE.md CLI
Semantics amended; ADR INDEX regenerated. Verified: milestone benches run
clean (ok=True, all artifacts), full suite 67 passed, lang-pairs OK.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 15:19:52 -07:00
ywkang e33e76f2d1 adr: add INDEX.md (auto-generated by tools/generate_adr_index.py)
Adds a section-based table of contents for the 46-ADR corpus, mirroring
the /report skill's classification (Design Principles / High-level
Architecture / Detailed Architecture by component / Implementation
Decisions by topic). Generated for both docs/adr/ (EN titles) and
docs/adr-ko/ (KO titles) from one tool.

tools/generate_adr_index.py:
- Single CLASSIFICATION dict per ADR — add an entry when introducing a
  new ADR; the script fails loud if any file is missing from the table.
- DETAILED_COMPONENTS lists each builtin component and the ADR(s) that
  cover it (ADR-0014 appears under six PE engines; ADR-0023 under
  pe_dma + pe_ipcq).
- Accepts both ":" and "—" title separators (matching ADR-0033's
  existing format).
- --check mode for CI: exits 1 if INDEX.md is stale.

Also includes the docs/report/architecture-2026-1H.md generated by the
prior /report write (the public-facing architecture document; 836 lines,
76 source-attribution comments).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 11:15:37 -07:00
ywkang bd49c93703 adr: add ADR-0050-0053 — close /report's second-pass G4 candidates
Documents four cross-cutting surfaces one layer deeper than the prior
G4 batch:

- 0050 par-ccl-algorithm-module-contract: how to author a new CCL
  algorithm in src/kernbench/ccl/algorithms/. Pairs with ADR-0045's
  bench-module contract. Pins the four required public symbols
  (kernel, kernel_args, TOPO_NAME_TO_KIND constants, kernel alias),
  the 9 + tl standardized kernel signature, the kernel_args tuple
  format, sip_topo_kind dispatch, and the ccl.yaml entry workflow.

- 0051 lat-routing-helper-api: every public method of AddressResolver
  (resolve, find_m_cpu, find_pcie_ep, find_io_cpu, find_all_pcie_eps)
  and PathRouter (find_path, find_path_with_distance,
  find_mcpu_dma_path, find_memory_path, find_node_path + 2 shims).
  Pins the four adjacency graphs (_adj_all / _adj / _adj_mcpu_dma /
  _adj_local) and the edge-kind exclusion sets they use, plus the
  single-owner naming convention.

- 0052 dev-oplog-memory-store-schemas: OpRecord's 7 fields, the
  per-op_name params matrix (dma_read, dma_write, gemm_*, math, math
  reduction, composite_gemm, ipcq_copy, unknown), snapshot timing
  rules (math = all inputs, dma_write = HBM-only — ADR-0027 race
  avoidance), TileToken stage_type capture, and MemoryStore's
  (space, addr) two-level dict with reference-store semantics.

- 0053 dev-topology-builder-algorithms: the 6-stage compile pipeline,
  cube_mesh.yaml's source_hash cache and its 5 input fields, the
  cube NoC auto-layout algorithm (row/col placement, HBM exclusion
  zone, PE/M_CPU/SRAM attachment via nearest-router, UCIe N/S/E/W
  distribution), the node naming convention (single-owner with
  router.py), the edge-kind catalog, the 4 view projections, and a
  table of spec-field changes vs mesh regeneration.

Bilingual pair verifier passes for all four EN/KO pairs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:52:42 -07:00
ywkang 9a02955770 adr: add ADR-0046-0049 — close G4 coverage gaps from /report
Documents four cross-cutting surfaces that previously had no ADR backing,
each surfaced as a G4 candidate by /report:

- 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates
  all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...),
  the two execution modes (command-list vs greenlet runner), scratch
  allocator semantics, dispatch-overhead model, and the kernel registry.

- 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group
  (backend="ahbm") install path. world_size priority (algorithm >
  defaults > topology), the 4-step init sequence (load ccl.yaml, import
  algorithm module, derive world_size, install SFR + IPCQ), greenlet-
  local rank registry, all_reduce dispatch via _defer_wait, barrier
  no-op rationale, and the explicit list of unsupported dist.* APIs.

- 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator
  free-list semantics. Offset-keyed first-fit with coalescing, the
  no-validation trust model for free(), HBM/TCM channel separation,
  page-aligned VA allocation, the page_size dual-default
  (VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and
  one-allocator-per-sub-unit rule.

- 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog.
  H2D / D2H / PE DMA categories with their exact cube-index choices,
  the 32 KiB reference size, the 5-point utilization sweep, the
  formula vs actual column meanings, automatic invariant checks
  (monotonicity, D2H >= H2D, best < worst), per-case GraphEngine
  isolation, and the human-readable (not machine-parsable) output
  contract.

Bilingual pair verifier passes for all four EN/KO pairs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:25:04 -07:00
ywkang 5f8dd688f5 adr: add ADR-0045 (bench module contract — registration, dispatch, authoring)
Documents src/kernbench/benches/: how @bench registration + audit work,
how the CLI dispatches via run_bench/RuntimeContext, and the contract a
new bench module must satisfy.

Nine decisions (D1-D9) cover:
- @bench name/description rules and duplicate detection
- Module-file convention (_-prefixed helpers vs bench modules)
- def run(torch) signature; torch = RuntimeContext
- Minimum-one-submit rule (else NO_REQUESTS)
- Single-device convention + multi-SIP CCL exception (ADR-0024/0027)
- resolve() name/index decision tree; indices are not a stable API
- Exact RuntimeContext surface exposed to benches
- Env-var parameterization (matmul_composite / gemm_sweep.py pattern)

Four alternatives rejected with documented reasons (manifest YAML,
decorator entry= arg, @multi_device_bench split, stable indices).

Verifier (tools/verify_adr_lang_pairs.py) passes for EN/KO pair.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 16:29:45 -07:00
56 changed files with 11807 additions and 1484 deletions
+7
View File
@@ -371,6 +371,13 @@ Concrete forms that Part 1's *Verification Plan* MUST take in this repo:
- `kernbench run --device <id>` runs the benchmark on a single device. - `kernbench run --device <id>` runs the benchmark on a single device.
- Omitting `--device` runs the benchmark on all devices discovered in the topology (logically parallel). - Omitting `--device` runs the benchmark on all devices discovered in the topology (logically parallel).
- Device enumeration is handled by the CLI only; benchmarks MUST remain single-device. - Device enumeration is handled by the CLI only; benchmarks MUST remain single-device.
- **Eval-bench exception (ADR-0054)**: a *milestone / eval bench*
(`milestone-1h-*`) may drive many configurations and build its own
per-config engines to regenerate a domain's full result + figure set; it
ignores `--device` and submits a sentinel tensor to satisfy the
"must submit ≥1 request" contract (ADR-0045 D4). This is the eval-harness
carve-out to the single-device rule, alongside the ADR-0024 multi-SIP CCL
exception.
## Derived Artifacts (Clarification) ## Derived Artifacts (Clarification)
@@ -7,6 +7,11 @@ Accepted
`tests/sccl/` 평가 하니스를 문서화한다; 구현과 대조 검증 완료 `tests/sccl/` 평가 하니스를 문서화한다; 구현과 대조 검증 완료
(상수, 파일 집합, 스윕 차원을 교차 확인). (상수, 파일 집합, 스윕 차원을 교차 확인).
**ADR-0054로 개정됨**: 드라이버 코어, sweep, renderer가 `milestone-1h-ccl`
bench(단일 home)로 이동했다; `tests/sccl/_allreduce_helpers.py`는 이제 거기서
re-export한다(pytest 전용 param 빌더 + `_run_distributed` wrapper는 로컬
유지). figure 테스트는 변경 없음.
## Context ## Context
ADR-0032는 intercube all-reduce *알고리즘*을 정의하고, ADR-0023/0024/0027은 ADR-0032는 intercube all-reduce *알고리즘*을 정의하고, ADR-0023/0024/0027은
@@ -8,6 +8,12 @@ GEMM 평가/특성화 하니스를 문서화한다; 구현과 대조 검증 완
(상수, tile 크기, figure 집합, script↔test 분할을 교차 확인). D5/D6 (상수, tile 크기, figure 집합, script↔test 분할을 교차 확인). D5/D6
caveat은 부정확이 아니라 기록된 한계다. caveat은 부정확이 아니라 기록된 한계다.
**ADR-0054로 개정됨**: sweep + renderer가 `milestone-1h-gemm` bench(단일
home)로 이동했다; `scripts/gemm_sweep.py``tests/gemm/`는 이제 거기서
re-export한다. D1/D2의 "데이터 생성은 수동 script / 무거운 작업은 opt-in"은
평가-bench 패턴으로 대체된다(하나의 bench가 전부 재생성;
`MILESTONE_FAST=1`은 committed JSON 재사용).
## Context ## Context
ADR-0014(PE pipeline)와 ADR-0042(tile-plan generator)는 GEMM *구현*을 ADR-0014(PE pipeline)와 ADR-0042(tile-plan generator)는 GEMM *구현*을
@@ -0,0 +1,265 @@
# ADR-0045: Bench Module Contract — registration, dispatch, and authoring
## Status
Accepted (2026-05-21).
`src/kernbench/benches/` 패키지의 등록 메커니즘(@bench), CLI 디스패치 경로
(`kernbench run/list`), 그리고 새 bench 모듈 작성 시 따라야 할 계약을 통합
정의한다. ADR-0010 (CLI surface)이 `kernbench list/run` 인터페이스를 명세하나,
**bench가 어떻게 등록되고 어떤 함수 시그너처를 따라야 하는가**는 ADR 레벨에
없었음.
**ADR-0054로 확장됨**: D5의 단일 구성 규칙에 세 번째 패턴이 추가된다 —
*평가 bench*(예: `milestone-1h-*`)는 여러 구성을 구동하고, 구성별 자체 엔진을
빌드하며, D4를 만족시키기 위해 sentinel 텐서를 제출한다.
## First action (제일 처음에 하는 일)
`kernbench.benches` 패키지가 임포트되면 `__init__.py` 가 즉시
`_eager_import_and_audit(__path__, __name__)` 를 호출한다. 이 함수의 첫 일은
패키지 디렉터리 안의 모든 형제 모듈을 `pkgutil.iter_modules(__path__)`로 나열한
뒤, 다음 두 조건을 만족하지 않는 모듈을 모두 `importlib.import_module(...)`
**즉시 로드**하는 것이다:
- 이름이 `registry` 인 경우 (인프라 자체)
- 이름이 `_` 로 시작하는 경우 (helper 모듈)
임포트 시점에 각 모듈 안의 `@bench(name=..., description=...)` 데코레이터가
실행되어 `_PENDING` 리스트에 `(name, description, fn)` 튜플이 append 되고,
`_REGISTERED_MODULES` 셋에 `fn.__module__` 가 추가된다.
전체 임포트가 끝나면 `_audit_modules(imported, _REGISTERED_MODULES)` 가 호출되어,
**임포트는 되었지만 @bench를 한 번도 호출하지 않은 모듈**이 있으면
`RuntimeError("Bench module(s) missing @bench decorator: ...")` 가 즉시 발생한다.
이 audit이 통과한 시점에 인덱스 할당은 아직 일어나지 않은 상태이며, 첫
`list_all()` / `resolve(...)` 호출 시 `_finalize()` 가 이름 알파벳 정렬 순으로
1-based index를 부여한다.
즉, **bench 인프라의 첫 일은 "패키지 디렉터리의 모든 비-helper 모듈을 임포트
하고, 각 모듈이 최소 한 번 @bench를 호출했는지 감사하는 것"** 이다.
## Context
`src/kernbench/benches/` 는 현재 8개의 bench 모듈을 보유한다 (`ccl_allreduce`,
`gemm_single_pe`, `gpt3_qkv`, `ipcq_allreduce`, `matmul_composite`, `qkv_gemm`,
`qkv_gemm_multi_pe`, `va_offset_verify`). 모든 bench는 다음 통합 흐름을 따른다:
```
kernbench run --topology <T> --bench <N>
cli/main.py::cmd_run
↓ resolve_topology(T) + resolve(N) + resolve_device(device_arg)
runtime_api/bench_runner.py::run_bench(topology, bench_fn, device, engine_factory)
↓ engine_factory(topology, device) → GraphEngine
↓ RuntimeContext(engine, target_device, correlation_id, spec)
bench_fn(ctx) ← bench가 정의한 run(torch) 가 호출됨
↓ ctx.empty/zeros/from_numpy/launch/distributed.* 등을 통해 submit
ctx.wait_all() ← 미완료 핸들이 있으면 drain
BenchResult(completion, correlation_id, trace, traces, engine)
```
ADR-0010 은 CLI 표면만 다루고 (`run/list/probe/web`), ADR-0007 은 runtime API ↔
sim_engine 책임 경계만 다룬다. 정작 "새 bench 파일을 추가하려면 어떤 모양으로
써야 하는가"는 코드 컨벤션만으로 추적해야 한다. 결과적으로:
- @bench 데코레이터의 호출 규약 (kebab-case 이름, non-empty description)이
코드에만 존재.
- bench 함수 시그너처 (`def run(torch)`) 가 사실상 컨벤션인데, CLI 디스패치 측이
`spec.run` 을 호출한다는 사실로 강제되고 있음.
- 신규 bench 추가자가 "helper 모듈은 `_` 접두로 분리해야 한다"는 것을 audit
RuntimeError를 받아본 뒤에야 학습.
- single-device 컨벤션 (CLAUDE.md Part 2 CLI Semantics)이 bench 작성자 관점에서
어디까지 적용되는지 (CCL 멀티-SIP bench는 예외인가?) 명확하지 않음.
이 ADR이 이런 모호함을 한 곳에 정리한다.
## Decision
### D1. @bench 데코레이터 계약
```python
from kernbench.benches.registry import bench
@bench(name="my-bench", description="Short, complete-sentence description.")
def run(torch):
...
```
- `name`: kebab-case 문자열. 정규식 `^[a-z][a-z0-9]*(-[a-z0-9]+)*$` 통과 필요.
소문자/숫자/대시만 허용; 밑줄(`_`) 금지; 알파벳으로 시작.
- `description`: non-empty 문자열 (strip 후 길이 > 0). CLI `list` 출력에 그대로
표시됨.
- 데코레이터는 **fn을 변형 없이 반환**한다 — 즉 직접 호출도 가능. 부수효과로
`_PENDING` 에 등록만 추가한다.
위 두 규칙 위반은 즉시 `ValueError`. duplicate name은 `_finalize()` 시점에
`RuntimeError("duplicate bench name: ...")` 로 잡힌다.
### D2. 모듈 파일 컨벤션
`src/kernbench/benches/<slug>.py` 는 다음 중 하나여야 한다:
- **bench 모듈**: 최상위 임포트 경로에서 적어도 한 번 `@bench(...)` 가 실행되어
최소 하나의 bench를 등록한다.
- **helper 모듈**: 파일명이 `_` 로 시작 (예: `_shared_helpers.py`). `iter_modules`
순회에서 스킵된다.
audit (`_audit_modules`) 는 helper가 아닌데도 @bench를 호출하지 않은 모듈을
허용하지 않는다. 의도된 결과: 새 파일을 `benches/` 에 추가하기만 하면 자동
등록되며, helper와의 구분은 **파일명 접두사** 하나로 명확하게 표시된다.
### D3. bench 함수 시그너처는 `def run(torch)` 다
데코레이터는 함수 이름을 강제하지 않지만, **CLI 디스패치는 `spec_entry.run`
(즉 데코레이트된 callable) 을 호출**한다. 따라서 컨벤션은:
- 함수 이름: `run`. 다른 이름으로 데코레이트해도 동작은 하지만 readability /
grep-ability 측면에서 항상 `run`.
- 인자: 단일 위치 인자 `torch`. 실제로는 `RuntimeContext` 인스턴스이며 PyTorch
스타일의 namespace (zeros/empty/launch/distributed/...)를 노출한다 (ADR-0024 D3).
- 반환값: 임의 (`Any`). 현재 `run_bench` 는 반환값을 무시하고 `ctx.handles()` /
`engine.get_completion()` 로 완료를 추적한다.
`torch` 이름은 PyTorch 호환 idiom을 흉내내기 위함이며, 실제로 PyTorch 모듈이
들어오는 것은 아니다 (ADR-0024 의 "rank = SIP" launcher 컨벤션과 정렬).
### D4. bench는 최소 한 번의 submit을 수행해야 한다
`run_bench``ctx.handles()` 가 비어 있는 경우 BenchResult.completion 을
`ok=False, error_code="NO_REQUESTS"` 로 반환한다. 따라서 의미 있는 bench는
다음 중 하나 이상을 호출해야 한다:
- 텐서 생성 API: `torch.zeros(...)`, `torch.empty(...)` — 내부적으로
`MmuMapMsg` 와 (zeros 의 경우) `MemoryWriteMsg` 가 submit 됨.
- 커널 실행 API: `torch.launch(name, fn, *args)``KernelLaunchMsg` 를 SIP 별로
submit.
- (예외) 빈 placeholder bench: `ipcq_allreduce.py` 처럼 `print(...)` 만 하는
스텁은 NO_REQUESTS 결과를 받게 됨. CI 측에서 placeholder임을 인지하고 별도
처리하는 것을 가정한다.
### D5. 단일-디바이스 컨벤션 + 멀티-SIP 예외 (ADR-0024/0027)
CLAUDE.md Part 2 CLI Semantics 가 명시하는 **"benchmarks MUST remain
single-device"** 컨벤션은 다음과 같이 해석된다:
- **일반 bench (single-SIP 사용)**: `dp = DPPolicy(...)` 로 텐서 placement를
정의하고 `torch.launch(...)` 로 커널 발사. SIP 인덱스는 `--device`
결정한다 (CLI 측 책임).
- **CCL bench (멀티-SIP 사용)**: 예외적으로 `torch.distributed.init_process_group
(backend="ahbm")` + `torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` 로
rank = SIP 패턴 (ADR-0024 D3) 을 따른다. `--device` 는 무시되며 (또는
`all` 로 가정), 각 spawned worker가 `torch.ahbm.set_device(rank)` 로 자신의
SIP를 바인딩한다.
이 두 패턴 외의 멀티-디바이스 호출 (예: 한 bench 함수가 동일 process에서 여러
SIP을 직접 launch) 은 본 ADR이 금지한다. CLI 가 `--device all` 로 호출되어도
bench는 한 번만 실행되며, 그 안에서 멀티-SIP을 다루려면 D5의 두 번째 패턴을
사용한다.
### D6. 이름·인덱스 해석 (`resolve`)
`resolve(identifier: str)` 는 다음 순서로 BenchSpec을 반환한다:
1. `identifier.isdigit()` → 정수 변환 후 `_REGISTRY` 의 entries에서 `index ==`
인 spec 반환. 없으면 `ValueError("No bench with index ..."`)`.
2. `identifier in _REGISTRY` → 직접 lookup.
3. 그 외 → `ValueError("Unknown bench ...")`.
빈/공백 identifier 는 `ValueError("bench identifier must be a non-empty string.")`.
CLI 는 `--bench` 의 인자를 그대로 `resolve` 에 넘긴다. 따라서 사용자는
`kernbench run --bench gemm-single-pe` 또는 `kernbench run --bench 2` 형식 모두
사용 가능.
### D7. 인덱스는 안정 API가 아니다
`_finalize()``_PENDING`**이름 알파벳 정렬** 후 1-based index를 부여하므로,
새 bench 가 추가되면 기존 bench의 index가 밀릴 수 있다. 따라서:
- 사람-친화적 인터랙티브 사용: 인덱스 OK.
- 스크립트 / CI 자동화: 반드시 이름을 사용한다.
이 사실은 `registry.py` 모듈 docstring 에 명시되어 있다.
### D8. RuntimeContext 가 bench에 노출하는 표면
bench 함수가 `torch` 파라미터를 통해 정상적으로 사용할 수 있는 표면:
- **텐서 생성**: `torch.empty(shape, dtype=..., dp=DPPolicy(...), name=...)`,
`torch.zeros(...)`, `torch.from_numpy(arr)`. 모두 host-side 메타 + 디바이스
배포 (MmuMap + MemoryWrite) 를 submit 한다.
- **커널 발사**: `torch.launch(kernel_name, kernel_fn, *args)`
`(Tensor, int, float)` 위치 인자를 `TensorArg` / `ScalarArg` 로 변환하여
SIP 별 `KernelLaunchMsg` 발행 후 drain.
- **동기화**: `torch.wait(handle)`, `torch.wait_all()` (run_bench 가 자동 호출).
- **분산**: `torch.distributed.init_process_group(backend="ahbm")`,
`torch.distributed.get_world_size()`, `torch.distributed.all_reduce(t, op=...)`
(ADR-0024/0027).
- **멀티-프로세스 (rank=SIP)**: `torch.multiprocessing.spawn(_worker, ..., nprocs=ws)`
(ADR-0024 D3 / ADR-0027).
- **디바이스 바인딩**: `torch.ahbm.set_device(rank)` 또는
`torch.accelerator.set_device_index(rank)` (둘 다 같은 namespace를 가리킴).
- **IPCQ 설치**: `torch.install_ipcq(algorithm=..., ccl_yaml=...)` (ADR-0023 D10).
- **스펙 조회**: `torch.spec` — 토폴로지 빌더가 만든 dict (시스템·cube_mesh·HBM
파라미터 등). bench가 toplogy.yaml 파라미터에 의존하지 않게 짜기 위함.
bench는 위에 열거되지 않은 RuntimeContext 의 private 멤버 (`_handles`, `_traces`,
`_allocators` 등) 에 직접 접근해선 안 된다. ADR-0007 의 layer boundary 정신과
정렬: bench → runtime API → sim_engine 한 방향만 허용.
### D9. 환경 변수로 파라미터화는 허용된다
`matmul_composite.py` 처럼 `os.environ.get("MATMUL_M", ...)` 등으로 bench
파라미터를 외부에서 주입하는 패턴은 허용한다. 이유:
- bench 함수 시그너처는 D3 에 의해 `def run(torch)` 로 고정되어 있어 위치/키워드
인자로 파라미터를 받기 곤란.
- 환경 변수 패턴은 `MATMUL_VARIANT` 같은 운영-시 스윕을 위한 자연스러운 hook.
- `scripts/gemm_sweep.py` 같은 외부 드라이버 (ADR-0044) 가 이 hook을 사용한다.
단, 환경 변수가 bench의 동작을 바꾼다면 모듈 docstring 에 모든 변수를 명시할 것
(matmul_composite.py 가 그 예시).
## Alternatives Considered
### A1. 명시적 manifest 파일 (YAML)에 bench 목록 두기
기각. @bench 데코레이터 + audit 패턴은 "파일 추가 = 자동 등록" 을 보장하여 신규
bench 작성자의 인지 비용을 한 곳 (파일 작성)으로 집중시킨다. 별도 manifest는
유지보수 측에서 drift 위험이 크고, helper 분리는 이미 `_` 접두로 명확하다.
### A2. bench 함수 이름을 데코레이터 인자로 받기 (`@bench(name=..., entry="run_xxx")`)
기각. 디스패치 측에서 `spec.run` 하나만 호출하면 되는 단순함을 깬다. `run` 컨벤션
하나로 충분하며, 변종이 필요하면 같은 모듈에 여러 함수를 등록하면 된다 (각각
@bench 데코레이트).
### A3. CCL bench를 위한 별도 `@multi_device_bench` 데코레이터
기각. D5에서 명시한 두 패턴 (single + ADR-0024 멀티-SIP) 만으로 현재 8개 bench가
모두 표현 가능. 별도 데코레이터는 디스패치 측에서 분기를 강제하여 복잡도를 늘리며,
멀티-SIP 사용 의도는 bench 함수 본문의 `init_process_group(...)` 호출로 충분히
드러난다.
### A4. 인덱스를 안정 API로 만들기 (등록 순서 / explicit index= 인자)
기각. D7에서 명시한 trade-off — 사용자 친화성 (알파벳 정렬된 인덱스가 list 출력
에서 자연스럽게 1, 2, 3...) 우선. 스크립트는 이름으로 지정하면 충분.
## Consequences
- "bench 추가 방법" 이 한 ADR로 정리됨 → 신규 작성자가 코드 grep 없이 D1-D3,
D8 만 따르면 됨.
- helper 모듈을 `_` 접두로 분리하는 패턴이 ADR-level에서 정당화되어, 향후
`benches/_*.py` 식의 공유 helper 작성이 자유로워짐.
- CLAUDE.md Part 2 CLI Semantics 의 single-device 컨벤션이 멀티-SIP CCL bench
와 모순되지 않음을 D5 가 명시 — 둘은 직교한다.
- ADR-0044 (GEMM eval harness) 의 `scripts/gemm_sweep.py` 가 환경 변수 hook을
사용하는 근거 (D9) 가 본 ADR에 굳어짐.
- 인덱스가 불안정함 (D7) 이 명시되어, CI 측 `kernbench run --bench 3` 같은
코드는 본 ADR 수락 직후 점검 대상.
@@ -0,0 +1,307 @@
# ADR-0046: TLContext — Kernel-side `tl.*` API Contract
## Status
Accepted (2026-05-22).
`src/kernbench/triton_emu/``TLContext` 가 노출하는 `tl.*` primitive
집합과 그 의미, 그리고 두 실행 모드 (command-list / greenlet runner) 의
계약을 명시한다. ADR-0014/0020 가 PE 파이프라인과 2-pass 실행 모델을
정의하나, **bench 의 kernel 함수가 호출하는 `tl.*` 표면 자체**는 ADR-level
에 정리되어 있지 않았다.
## First action (제일 처음에 하는 일)
`TLContext(pe_id, num_programs, dispatch_cycles, runner, cube_id, num_cubes,
scratch_base, scratch_size)` 생성 시 가장 먼저 다음 6개 필드를 초기화한다:
- `self._pe_id`, `self._num_programs`, `self._cube_id`, `self._num_cubes`
`tl.program_id` / `tl.num_programs` 가 반환할 값.
- `self._dispatch_cycles` — 모든 `tl.*` API 호출 시작에서 자동으로 발행될
`PeCpuOverheadCmd(cycles)` 의 cycle 수.
- `self._runner``KernelRunner` 인스턴스 (있으면 greenlet 모드, 없으면
command-list 모드).
- `self._commands: list[PeCommand] = []` — command-list 모드에서 누적할
command 시퀀스.
- `self._handle_counter = 0`, `self._completion_counter = 0` — 새 TensorHandle /
CompletionHandle id 생성용.
- `self._scratch_base`, `self._scratch_size`, `self._scratch_cursor = 0`
PE-로컬 scratch 영역 (math/dot/composite 의 output handle 주소 할당용).
즉, **TLContext 의 첫 일은 "이 kernel 인스턴스가 어디서 (sip/cube/pe) 어떤
규모 (num_programs/num_cubes) 로 실행되며, 어느 모드 (runner 유무) 로
명령을 발사할지 메타데이터를 채우는 것"** 이다. 이 시점에 SimPy event 는
없으며 command 도 발사되지 않는다.
런타임 첫 동작은 kernel 함수가 `tl.<api>()` 를 처음 호출할 때 발생한다.
모든 `tl.*` API 의 표준 entry 동작은:
1. `self._emit_dispatch_overhead()` 호출 — `dispatch_cycles > 0` 인 경우
`PeCpuOverheadCmd(dispatch_cycles)` 를 즉시 `_emit`.
2. API 별 처리 (TensorHandle 생성, command 구성).
3. `self._emit(cmd)` — runner 모드면 greenlet.switch 로 SimPy 측에 cmd 전달,
아니면 `self._commands` 에 append.
## Context
`tl.*` 표면은 `TLContext` 가 노출하는 메소드들로 구성되며, kernel 함수가
받는 `tl` 매개변수가 이 객체다. 사용자(bench 작성자) 입장에서 보이는
contract:
- 어떤 primitive 가 있는가
- 각 primitive 가 어떤 데이터 흐름을 발생시키는가 (DMA / compute / IPCQ /
metadata-only)
- TensorHandle 의 `space``addr` 가 어떻게 결정되는가
- command-list 모드와 greenlet 모드의 차이
ADR-0014 (PE pipeline) 가 PE_SCHEDULER 가 받는 PeCommand 들을 정의하나,
`tl.*` 가 이들을 어떻게 emit 하는지는 코드 컨벤션에만 존재한다. 또한
ADR-0020 (2-pass data execution) 가 greenlet 모드의 존재를 D3 에서
언급하나, runner / non-runner 두 경로의 시그너처 차이 (return value 처리)
는 ADR-level 에 명시되어 있지 않다. 이 ADR 이 그 빈자리를 채운다.
## Decision
### D1. `tl` 매개변수는 `TLContext` 인스턴스다
bench 의 kernel 함수는 다음 시그너처를 따른다:
```python
def _kernel(arg1, arg2, ..., tl, **kwargs):
...
```
`tl` 의 정체는 `kernbench.triton_emu.tl_context.TLContext` 인스턴스이다.
real Triton 의 `triton.language` 모듈을 흉내내기 위한 이름이며, real
Triton 모듈이 들어오는 것은 아니다.
kernel 함수는 일반 Python 함수이며 `yield` / `async` 가 없다. `tl.*`
호출이 SimPy event 를 발생시키지만, 호출자(kernel) 쪽에서는 동기 호출처럼
보인다 — greenlet 모드에서 KernelRunner 가 SimPy ↔ kernel 사이를 중계
하기 때문 (ADR-0020 D3).
### D2. 두 실행 모드 — command-list / greenlet runner
- **command-list 모드 (`runner is None`)**: `tl.*` 호출이 `self._commands`
리스트에 PeCommand 를 누적. DMA / GEMM / Math 가 실제 SimPy 시간을
소비하지 않으며, return value 가 metadata-only TensorHandle (data=None) 다.
이후 PE_SCHEDULER / sim_engine 가 command 시퀀스를 시간상 재생.
- **greenlet runner 모드 (`runner is not None`)**: `tl.*` 호출이
`self._emit(cmd)` 를 통해 `runner.switch_to_simpy(cmd)` 로 부모 greenlet
(SimPy) 으로 컨트롤을 넘김. 부모는 cmd 를 컴포넌트에 분배하여 SimPy 시간을
소비한 뒤, DMA read 의 경우 실제 numpy 데이터를 반환. kernel 은 그
결과를 받아 다음 line 으로 진행 (ADR-0020 D3 의 데이터 인지 실행 모델).
mode 선택은 KernelRunner 인스턴스를 TLContext 에 주입하는지 여부로 결정
되며, `tl.*` 메소드들은 이 차이를 인지하지 않고 `_emit()` 헬퍼를 통해
일관되게 동작한다.
### D3. Primitive 카테고리
#### D3.1. Reference (no DMA, metadata only)
- `tl.ref(ptr, shape, dtype="f16") -> TensorHandle`: HBM 데이터를 참조하는
핸들만 만들고 DMA 는 발행하지 않음. composite scheduler 가 per-tile 로
스트리밍할 때 사용 (예: GEMM 의 b 피연산자).
#### D3.2. Data movement (blocking, DMA engine)
- `tl.load(ptr, shape, dtype="f16") -> TensorHandle`: HBM → 결과 핸들.
`DmaReadCmd` 발행. greenlet 모드에서는 결과 핸들의 `.data` 에 실제
numpy 배열 첨부; command-list 모드에서는 placeholder. 반환 핸들의
`space="hbm"`, `pinned=True`.
- `tl.store(ptr, handle) -> None`: TCM → HBM. `DmaWriteCmd` 발행. greenlet
모드에서는 `handle.data` 가 있을 때만 `_store.write("hbm", ptr, data)`
먼저 호출 (visibility = issue time, ADR-0020 D3).
#### D3.3. GEMM / compute (blocking)
- `tl.dot(a, b) -> TensorHandle`: `a @ b`. 두 피연산자는 TCM 이어야 하며,
shape (M,K) × (K,N) → (M,N). `GemmCmd` 발행, output handle 은
`_make_compute_out(shape, dtype)` 로 PE-로컬 scratch 에 할당.
- `tl.composite(op, a, b=None, out_ptr=0, math_op=None, epilogue=None,
acc_dtype=None, tile_shape=None) -> CompletionHandle`: 비차단(non-blocking)
tiled pipeline. `CompositeCmd` 발행. `epilogue` 는 dict list, 각 dict 는
`"op"` 키 + op-specific 필드 + 옵션 `"scope"` (k_tile / output_tile);
unknown op 나 missing field 는 즉시 ValueError. 반환된 CompletionHandle 은
`tl.wait(h)` 로 동기화.
#### D3.4. Math: unary (blocking)
- `tl.exp(x)`, `tl.log(x)`, `tl.sqrt(x)`, `tl.abs(x)`, `tl.sigmoid(x)`,
`tl.cos(x)`, `tl.sin(x)` — 모두 `MathCmd(op=<name>, inputs=(x,), out=)`
발행. `out` 은 동일 shape/dtype 의 scratch 할당.
#### D3.5. Math: binary (blocking)
- `tl.maximum(a, b)`, `tl.minimum(a, b)` — `_binary_math`.
- `tl.fma(a, b, c)` — `a*b + c`. inputs 3개.
- `tl.clamp(x, min, max)` — `MathCmd(op="clamp", inputs=(x, min, max))`.
- `tl.where(cond, a, b)` — `MathCmd(op="where", inputs=(cond, a, b))`.
- `tl.softmax(x, axis=-1)` — 단일 MathCmd(op="softmax") 로 시간 회계는
한 번에. Phase 2 DataExecutor 가 canonical (x-max → exp → sum → div) 로
expand 한다.
#### D3.6. Reduction (blocking)
- `tl.sum(x, axis)`, `tl.max(x, axis)`, `tl.min(x, axis)` — 해당 axis 의
크기를 1 로 줄인 output handle 을 반환. `MathCmd(op=<name>, inputs=(x,),
out=, axis=axis)` 발행.
#### D3.7. Index / scalar (PE_CPU, no engine)
- `tl.program_id(axis=0) -> int`: `axis==0` → pe_id (cube-local PE 인덱스),
`axis==1` → cube_id (ADR-0022).
- `tl.num_programs(axis=0) -> int`: `axis==0` → num_programs (cube 당
PE 수), `axis==1` → num_cubes.
- `tl.arange(start, end, dtype="i32") -> TensorHandle`: TCM 의 인덱스
range. command 발사 없이 metadata 만.
- `tl.zeros(shape, dtype="f16") -> TensorHandle`, `tl.full(shape, value,
dtype="f16") -> TensorHandle`: TCM 에 placeholder. command 발사 없음.
#### D3.8. Scalar helpers (no command, no engine)
- `TLContext.cdiv(a, b) -> int` (static): ceiling division
`-(-a // b)`. real Triton 의 `tl.cdiv` 모방.
#### D3.9. Metadata-only (no compute, no DMA)
- `tl.trans(x) -> TensorHandle`: shape 의 마지막 두 dim 을 swap 한 새
핸들. 같은 addr/data 를 공유, command 발사 없음.
#### D3.10. IPCQ (CCL) primitives (ADR-0023 D4)
- `tl.send(dir, src=None, *, src_addr=None, nbytes=None, shape=None,
dtype="f16", space="tcm") -> None`: blocking send. handle 형태 또는
raw 주소 형태 둘 다 허용. `IpcqSendCmd` 발행. handle 의 `.data` 스냅샷이
명령에 실리는 경우, recv 측에서 받은 데이터의 race 회피.
- `tl.recv(dir=None, shape=(), dtype="f16", space="tcm", dst_addr=None,
dst_space=None) -> TensorHandle`: blocking recv. `dst_addr/dst_space`
둘 다 주면 "copy_to_dst" 모드, 아니면 "return_slot" 모드. greenlet
모드에서 핸들의 `.data` 에 실제 데이터 첨부.
- `tl.recv_no_consume(dir=None, shape=(), dtype="f16") -> TensorHandle`:
**DIAGNOSTIC ONLY**. recv blocking 동기화는 그대로 적용되나 slot-read
latency (slot-IO + PE↔bank fabric drain) 는 건너뛴다. pe2pe overview
플롯에서 `tl.store` 와의 apples-to-apples 비교용. production kernel 은
사용 금지 — `consume=False` 라는 별도 명령 분기로 격리되어 있어 실수
flag 가 작동하지 않는다.
- `tl.recv_async(dir, shape=(), dtype="f16") -> RecvFuture`: non-blocking
recv. `RecvFuture` 를 반환; 이후 `tl.wait(future)` 로 결과 수령.
#### D3.11. Composite + control
- `tl.composite(...)`: D3.3 에서 설명.
- `tl.wait(handle=None)`: `CompletionHandle` (composite) 또는 `RecvFuture`
(async recv) 또는 `None` (모든 pending composite) 대기.
- `tl.cycles(n)`: PE_CPU scalar 실행 overhead 를 명시적으로 선언.
`PeCpuOverheadCmd(cycles=n)` 발행.
### D4. TensorHandle 산술 연산자 — thread-local TLContext
`tl_context.py` 모듈 로드 시점에 `_enable_tensor_ops()` 가 호출되어
`TensorHandle.__add__`, `__sub__`, `__mul__`, `__truediv__` 를 patch한다.
각 연산자는 thread-local `_ctx` (모듈 변수) 에 저장된 active TLContext 의
`_binary_math` 를 호출한다.
따라서 kernel 안에서 `c = a + b` 는 `MathCmd(op="add", inputs=(a,b),
out=)` 발행 + new TensorHandle 반환 패턴과 동일하다.
active TLContext 관리:
- `TLContext._set_active(ctx)`: 현재 thread/greenlet 의 active ctx 설정.
- `TLContext._get_active()`: 조회 (없으면 RuntimeError).
- `run_kernel(kernel_fn, tl_ctx, *args, **kwargs)`: helper. 진입 시
active 설정, kernel 실행, 종료 시 None 으로 복원.
`KernelRunner` 는 매 cmd 분배 시 `_switch_kernel` 가 직접 `_set_active(tl)`
를 호출하여, 같은 thread 안의 다른 PE runner 가 active 를 덮어쓴 경우에도
복원되도록 한다.
### D5. Scratch allocator — compute output handles
`tl.dot`, `tl.exp`, `tl.add` (TensorHandle `__add__`) 등 결과를 만드는 op 는
`_make_compute_out(shape, dtype)` 를 호출하여 16-byte aligned scratch
주소를 할당한다. 이 주소는 `space="tcm"` 로 발행되며, 이후 `tl.send` /
`tl.store` 가 이 handle 을 source 로 사용할 수 있다.
`_scratch_base == 0` (command-list 모드 등) 이면 할당 주소가 0으로
반환되어 handle 은 send/store 의 source 로 사용 불가 (이 경우 `tl.load`
로 받은 핸들만 source 가 될 수 있다).
cursor 가 `_scratch_size` (default 1 MiB) 를 초과하면 RuntimeError.
cursor 는 매 kernel invocation 시작 시 0 으로 리셋되어야 하나 (현재 코드는
KernelRunner 가 새 TLContext 를 매번 생성하여 자연스럽게 리셋됨).
### D6. Dispatch overhead — `PeCpuOverheadCmd(dispatch_cycles)`
모든 non-metadata `tl.*` 호출의 entry 에서 `_emit_dispatch_overhead()` 가
호출되며 `dispatch_cycles > 0` 일 때 `PeCpuOverheadCmd(dispatch_cycles)`
를 발행한다. PE_CPU 가 명령 dispatch 자체에 소비하는 cycle 비용을
모델링하기 위함이다.
기본값:
- `TLContext.__init__` 의 `dispatch_cycles` 매개변수 기본값: 1 cycle.
- `KernelRunner` 가 만드는 TLContext: 0 cycles (greenlet 모드는 cycle
회계가 별도, ADR-0020 D3 정신).
### D7. Kernel registry (`triton_emu/registry.py`)
별도의 `_kernels: dict[str, Callable]` 가 kernel 이름 → 함수 매핑을 보유:
- `register_kernel(name, fn)`: duplicate 등록 시 ValueError.
- `get_kernel(name)`: 미등록 시 KeyError.
- `clear_registry()`: 테스트 전용.
`RuntimeContext.launch(kernel_name, kernel_fn, *args)` 가 매 호출마다
`_kernels[kernel_name] = kernel_fn` 으로 idempotent 덮어쓴다 (last call
wins). 이는 ADR-0045 D8 의 launch 동작과 정합된다.
PE_CPU 는 `KernelRef.name` 으로 registry 에서 kernel 함수를 lookup 한 뒤
KernelRunner 로 실행한다.
## Alternatives Considered
### A1. tl.* 를 ADR-0014 / ADR-0020 안으로 통합
기각. ADR-0014 는 PE pipeline (PeCommand 의 sim_engine 측 소비) 를, ADR-0020
은 2-pass 실행 (Phase 1 timing / Phase 2 data) 을 다룬다. `tl.*` 는 kernel
작성자가 만나는 API 표면이라 독립 분리하는 것이 검색성·온보딩 측면에서
낫다.
### A2. command-list 모드 deprecation
기각 (현재). 단순한 unit test 와 kernel verification 에서 command-list
모드가 가볍게 동작한다. greenlet 의존성 없이 PeCommand 시퀀스를 검사할 수
있는 출입구로 유지한다. greenlet 모드만의 의미 (실데이터, Phase 2) 가
필요하면 D2 의 mode 선택으로 명시적으로 들어간다.
### A3. TensorHandle 산술 연산자 제거
기각. real Triton 의 kernel 코드 가독성을 흉내내기 위함이며 (예: `c = a +
b`), thread-local active ctx 패턴이 깔끔하게 작동 중. 명시적 `tl.add(a, b)`
도 D3.5 에 노출되어 있어, 연산자가 헷갈리면 함수형 호출로 대체 가능.
### A4. softmax 를 명시적 시퀀스 (max → exp → sum → div) 로 expand
부분 채택. `tl.softmax` 는 단일 `MathCmd(op="softmax")` 로 timing 회계는
한 번에 처리한다 (D3.5). 실 데이터 expansion 은 Phase 2 DataExecutor 가
canonical 시퀀스로 풀어준다. 즉, 시간 모델은 atomic, 데이터 모델은
expansion — 두 마리 토끼를 의도적으로 분리.
## Consequences
- bench 작성자가 만나는 모든 `tl.*` primitive 가 한 ADR 에 분류·정의됨.
ADR-0045 D8 의 host-side surface (torch.empty 등) 와 짝을 이루어 "kernel
안 / 밖" 양쪽 작성 가이드가 완성.
- command-list / greenlet 두 모드의 차이가 D2 에 명시되어, 새로운 `tl.*`
primitive 추가 시 `_emit()` 패턴만 따르면 양쪽 자동 호환됨.
- thread-local active ctx 패턴 (D4) 이 ADR-level 에서 정당화되어, 향후
multi-PE 동일-thread 실행 시 reset 책임이 어디인지 명확해짐
(`_switch_kernel` 가 cmd 분배 시 active 복원 — KernelRunner.run 의
contract).
- `tl.recv_no_consume` 의 진단 전용 격리(D3.10) 가 ADR 에 굳어져, 실수로
production kernel 에서 사용되는 것을 막는 layer 가 명확.
- registry (D7) 가 별도 D 항목으로 분리되어, kernel 이름 충돌 / 동적
재등록 동작의 사양이 명시.
@@ -0,0 +1,243 @@
# ADR-0047: AHBM CCL Backend — `torch.distributed`-compat shim
## Status
Accepted (2026-05-22).
`runtime_api/distributed.py``AhbmCCLBackend` + `DistributedContext`
`torch.distributed.init_process_group(backend="ahbm")` 진입점이 실제로
무엇을 설치하고 어떤 의미로 `all_reduce`/`barrier`/`get_rank` 등을
구현하는지를 명시한다. ADR-0023 D11 이 "torch.distributed compatibility"
의도를 언급하나, **backend 자체의 동작 모델**은 ADR-level 에 없었다.
## First action (제일 처음에 하는 일)
`RuntimeContext.__post_init__` 가 자동으로 `DistributedContext()` 인스턴스를
만들어 `self.distributed` 에 attach 한다. 그 시점의 첫 일은:
1. `self._backend: AhbmCCLBackend | None = None` 으로 초기화 (아직 init
되지 않은 상태).
2. `self._rank_by_greenlet: dict = {}` 로 greenlet-local rank 레지스트리
초기화 (ADR-0024 D2).
3. 호출자(RuntimeContext) 측에서 `dc._ctx_ref = self` 로 back-reference 를
심어, 이후 `init_process_group``ctx.engine` / `ctx.spec` / `ctx.launch`
에 도달할 수 있게 한다.
즉, **DistributedContext 의 첫 일은 "RuntimeContext 에 자기 자신을
back-reference 와 함께 부착하고 backend 슬롯을 비워두는 것"**. 실제 backend
설치(IPCQ install, world_size 산출, 알고리즘 모듈 로드)는 사용자 코드의
`torch.distributed.init_process_group(backend="ahbm")` 호출 시점에 비로소
일어난다.
해당 시점의 `init_process_group` 의 첫 일은:
1. `backend != "ahbm"` 이면 즉시 `ValueError("Unsupported backend ...")`.
2. `getattr(self, "_ctx_ref", None)` 가 None 이면
`RuntimeError("DistributedContext not bound to a RuntimeContext")`.
3. `self._backend = AhbmCCLBackend(torch_ctx=ctx)` — 이 생성자 안에서
ccl.yaml load + 알고리즘 모듈 import + world_size 산출 + SFR 설정 +
IPCQ install 이 모두 일어난다.
4. `self._backend._dist_ctx = self` — backend 가 거꾸로
`_rank_by_greenlet` 에 접근할 수 있게 함.
## Context
PyTorch DDP 의 collective 호출 (`init_process_group`, `all_reduce` 등) 을
그대로 사용할 수 있게 만들어, bench 코드가 "진짜 DDP training script" 와
동일한 모습이 되도록 하는 것이 `AhbmCCLBackend` 의 목적이다 (ADR-0024 +
ADR-0027 의 launcher 모델과 정렬).
이 backend 가 책임지는 것:
- `init_process_group` 시점에 **IPCQ neighbor table 을 한 번 설치** (real
NCCL communicator creation 과 유사).
- `all_reduce(tensor, op="sum")` 호출 시 **설정된 algorithm 의 kernel 함수
`ctx.launch(...)` 로 발사**.
- `get_world_size` / `get_rank` 를 greenlet-local rank 레지스트리와
ccl.yaml/topology 로부터 일관되게 답함.
ADR-0023 D10 (IPCQ install plan), ADR-0024 (SIP launcher) 가 부분적으로
이를 다루나, **`AhbmCCLBackend` 자체의 책임 범위와 의사결정 순서**는
어디에도 명시되어 있지 않다. 본 ADR 이 채운다.
## Decision
### D1. backend 는 `init_process_group(backend="ahbm")` 시점에만 생성된다
`DistributedContext``__init__` 시점에 `_backend = None` 으로 시작한다.
backend 객체는 사용자가 `dist.init_process_group(backend="ahbm")`
호출하기 전까지 존재하지 않으며, 그 외 API (`is_initialized`,
`get_world_size`, `all_reduce`, `barrier`) 가 backend 가 None 인 채로
호출되면 `RuntimeError("Default process group has not been initialized...")`
를 던진다 (`_ensure_initialized` 헬퍼).
`backend != "ahbm"` 은 즉시 `ValueError`. 다른 backend 명 (nccl, gloo
등) 은 인식하지 않는다.
### D2. world_size 산출 우선순위 — algorithm > defaults > topology
`AhbmCCLBackend._resolve_world_size` (ADR-0024 D1) 의 결정 순서:
1. `ccl.yaml` 의 algorithm entry 에 `world_size` 가 있으면 그 값.
2. `defaults.world_size` 가 있으면 그 값.
3. 둘 다 없으면 `spec.system.sips.count` (=topology 의 SIP 개수).
기본 의미는 **rank = SIP** (ADR-0024). cube/PE-level parallelism 은 각
rank 안에서 DPPolicy 로 표현되며 world_size 에 영향을 주지 않는다. 명시적
`ccl.yaml` 의 world_size override 가 있으면 legacy "rank = flat PE 인덱스"
테스트 경로를 위해 그대로 존중된다.
`init_process_group(world_size=..., rank=...)` 의 사용자 인자는 **수신하나
무시**된다 (real PyTorch 의 `RANK` / `WORLD_SIZE` env var 와 같은 의미).
### D3. `init_process_group` 가 즉시 하는 4가지 설치 작업
`AhbmCCLBackend.__init__` 안에서 다음이 순차 실행된다:
1. **ccl.yaml 로딩**: `kernbench.ccl.install.load_ccl_config()`
`resolve_algorithm_config(_cfg_all)``defaults.algorithm` (또는
사용자가 지정한 알고리즘) 의 merged config 산출.
2. **알고리즘 모듈 import**: `importlib.import_module(self._merged["module"])`.
이 모듈은 `kernel` 함수, `kernel_args(world_size, n_elem, cube_w, cube_h)`
helper, optional `TOPO_NAME_TO_KIND` 매핑을 노출해야 한다.
3. **world_size 산출** (D2).
4. **topology 메타 수집**: `spec` 으로부터 `n_sips`, `sip_topo` (`ring_1d`
기본), `cube_w`/`cube_h`, `sips.w`/`sips.h`. SIP topology 가 ring_1d 가
아니면 explicit `w`/`h` 또는 square root 로 (`w*h == n_sips` 보장)
`_sip_topo_w/h` 산출. 불일치 시 `ValueError`.
5. **SFR + IPCQ 설치**: `kernbench.ccl.sfr_config.configure_sfr_intercube_multisip
(engine, spec, self._merged)` 를 호출. 이 함수가 모든 SIP/cube 의 pe0 에
IPCQ neighbor table 을 푸시 (real NCCL communicator 의 일회성 설정에
해당).
이 순서가 변하면 (예: SFR 전에 algorithm 모듈 load 가 실패하면) 부분 초기화
상태가 발생할 수 있다. 따라서 D3 는 atomic 한 4-단계로 본다 — 실패 시
backend 는 미설치 상태로 남는다.
### D4. greenlet-local rank 등록 (ADR-0024 D2)
`DistributedContext._rank_by_greenlet: dict[greenlet, int]` 은 spawn 된
worker greenlet 각각에 rank 를 매핑한다. bench launcher (예:
`torch.multiprocessing.spawn`) 가 worker 를 띄울 때
`dc._bind_rank(g, rank)` 를 호출하여 등록한다.
`get_rank()` 는 `getcurrent()` 의 greenlet 을 lookup. 미등록 greenlet은
fallback 으로 0 을 반환 — single-driver / 테스트 호환성 유지.
backend 는 `_dist_ctx._rank_by_greenlet` 를 통해 `all_reduce` 시 현재
greenlet 의 rank 를 가져온다 (D5).
### D5. `all_reduce(tensor, op="sum")` 동작
검증 단계:
- `op != "sum"` → `NotImplementedError`. 현재 kernel 들은 add reduction만 구현.
- `tensor._handle is None` → `RuntimeError("not deployed")`.
- `tensor._handle.shards` 가 비면 `RuntimeError("no shards")`.
준비 단계:
- `n_elem = shards[0].nbytes // tensor.itemsize` — 단일 shard 의 element 수.
- `kernel_fn = self._algo_module.kernel` — D3 에서 import 된 알고리즘 모듈의
진입 함수.
- effective cube dims 결정: 첫 번째 SIP 의 cube 갯수가 1 이면 (1,1) 으로
scalar 처리, 아니면 토폴로지의 `cube_w`/`cube_h` 사용. TP 가 일부 cube
만 쓰는 경우를 자연스럽게 흡수.
- `kernel_args = self._algo_module.kernel_args(world_size, n_elem, cube_w,
cube_h)` — 알고리즘이 자기 kernel 에 넘길 인자 셋을 결정.
dispatch:
- 현재 greenlet 의 rank 를 `_rank_by_greenlet.get(g, 0)` 로 lookup.
- `extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` 를 append.
- `pending = self.ctx.launch(algorithm_name, kernel_fn, tensor, *kernel_args,
*extra_args, _defer_wait=True)` — `_defer_wait=True` 로 collective drain
을 메인 scheduler 에 위임 (ADR-0027 D0.4).
drain:
- 부모 greenlet 이 살아있으면 (multi-greenlet 모드) `_pending_collective_handles`
에 enqueue 한 뒤 부모로 switch. 메인 scheduler 가 모든 rank 의 launch 후
일괄 drain.
- 단일-driver 모드면 inline 으로 `for h, _sip_id, meta in pending:
self.ctx.wait(h, _meta=meta)` 즉시 drain.
### D6. `barrier()` 는 no-op 이다 (single-driver 모델)
kernbench 는 하나의 Python process 안에서 모든 rank 를 greenlet 으로 다룬다.
process 간 동기화가 필요한 상황이 없으므로 `barrier()` 는 호출 가능하지만
실제 어떤 동기화도 수행하지 않는다. real PyTorch DDP 와의 API 호환성을
위해 유지 (호출자가 NotImplementedError 를 받지 않도록).
장래에 multi-process kernbench (예: SimPy event loop 가 process 별로
독립) 가 도입되면 D6 를 supersede 하는 새 ADR 이 필요.
### D7. `get_rank` / `get_world_size` / `get_backend` 의 의미
- `get_rank()` (D4): 현재 greenlet 의 bound rank. 미등록은 0.
- `get_world_size()` (D2): backend 가 D3 에서 산출한 world_size.
- `get_backend()`: 항상 `"ahbm"` 문자열. backend 객체가 존재하지 않으면
`_ensure_initialized` 에서 RuntimeError.
real PyTorch 와의 차이:
- real PyTorch `get_rank()` 는 process global 값이지만, kernbench 는
greenlet-local. spawn 된 worker 안에서 호출하면 rank, main thread 에서
호출하면 0. bench 작성자는 worker 함수 안에서만 의미 있는 rank 를 기대해야
한다.
### D8. 지원하는 API 표면 (final)
`DistributedContext` 가 노출하는 API:
- `init_process_group(backend="ahbm", world_size=None, rank=None, **kwargs)`
- `is_initialized() -> bool`
- `get_world_size() -> int`
- `get_rank() -> int`
- `get_backend() -> str`
- `all_reduce(tensor, op="sum") -> None`
- `barrier() -> None`
- (internal) `_bind_rank(g, rank)`
이외의 PyTorch distributed API (broadcast, reduce, all_gather, gather,
scatter, send/recv 등) 는 **아직 구현되어 있지 않다**. kernel 레벨에서는
`tl.send`/`tl.recv` (ADR-0046 D3.10) 로 직접 표현 가능하나, dist.* surface
로는 노출되지 않는다. 추가 collective 가 필요해질 시 별도 알고리즘 모듈
+ `DistributedContext` 메소드 한 쌍을 추가하여 D8 를 확장한다.
## Alternatives Considered
### A1. backend 를 `RuntimeContext.__init__` 에서 즉시 생성
기각. ccl.yaml 이 없거나 알고리즘 모듈을 import 할 수 없는 경우, bench 가
distributed 기능을 안 쓰는데도 RuntimeContext 생성 자체가 실패하게 된다.
"호출 시점에 비로소 설치" (D1) 가 lazy 의미상 옳다.
### A2. world_size 를 항상 topology 로부터 자동 산출 (override 금지)
기각. ADR-0024 D1 의 "explicit override" 경로가 legacy 테스트에서 사용 중.
한 SIP 안에서 PE-level rank 를 따로 정의해야 하는 진단 시나리오를 위해
유지.
### A3. `op != "sum"` 을 silent fallback 으로 처리
기각. 사용자가 `op="prod"` / `"max"` / `"avg"` 를 의도했는데 silently sum
이 실행되면 결과 검증이 매우 어렵다. 명시적 `NotImplementedError` 가 안전.
### A4. `barrier` 를 SimPy event 로 구현
기각 (현재). single-driver 모델에서 cross-process 동기화 의미가 없으므로
no-op 가 의미적으로 정확. SimPy fake-barrier 는 의미 없이 코드 복잡도만
높임. multi-process kernbench 도입 시 재평가.
## Consequences
- `torch.distributed.init_process_group(backend="ahbm")` 의 4-단계 설치
(D3) 가 ADR-level 에서 굳어져, 향후 새 collective 알고리즘이 어디에
훅을 걸어야 하는지 명확.
- D2 의 우선순위 (algorithm > defaults > topology) 가 명시되어, ccl.yaml
변경 시 영향 범위를 빠르게 가늠 가능.
- D6 의 barrier no-op 결정이 ADR-level 에 굳어져, multi-process kernbench
도입 시 별도 ADR 로 supersede 해야 함이 분명.
- D8 의 미지원 API 목록이 명시되어, 사용자가 `dist.broadcast(...)` 를
호출하려 할 때의 명확한 거절 근거 제공.
@@ -0,0 +1,262 @@
# ADR-0048: Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
## Status
Accepted (2026-05-22).
`policy/address/allocator.py``_FreeList` / `PEMemAllocator`
`va_allocator.py``VirtualAllocator` 가 사용하는 free-list 알고리즘,
페이지 정렬, coalescing 규칙을 명시한다. ADR-0001 (PhysAddr 레이아웃) 과
ADR-0011 (PA/VA/LA 모델) 이 주소 스킴을 정의하나, **할당 알고리즘**은 별도
ADR 이 없었다.
## First action (제일 처음에 하는 일)
### `_FreeList(capacity)`
생성 즉시 `self._capacity = capacity`, `self._used = 0`, `self._free =
[(0, capacity)]` 로 초기화. 첫 일은 **전 영역을 single free block 으로
세우는 것** — 즉 `(offset=0, size=capacity)` 한 튜플이 free list 의 유일한
원소다.
### `PEMemAllocator(sip_id, die_id, pe_id, cfg)`
생성 즉시 두 개의 `_FreeList` 를 만든다:
- `self._hbm = _FreeList(cfg.hbm_slice_bytes)` — 이 PE 가 소유한 HBM
slice 의 바이트 크기 (`hbm_bytes_per_cube // hbm_slices_per_cube`) 만큼.
- `self._tcm = _FreeList(cfg.tcm_allocatable_bytes)` — `tcm_bytes_per_pe -
tcm_scheduler_reserved_bytes` 만큼 (scheduler 예약분은 사전 분리).
따라서 PEMemAllocator 의 첫 일은 **이 PE 의 HBM slice 와 사용자
TCM 영역을 각각 단일 free block 으로 세우는 것**.
### `VirtualAllocator(va_base, va_size, page_size=2*1024*1024)`
생성 즉시 `self._va_base = va_base`, `self._va_size = va_size`,
`self._page_size = page_size`, `self._used = 0`, `self._free = [(va_base,
va_size)]`. 첫 일은 **VA base 부터 size 까지 single block 으로 세우고
page_size 를 회수**.
## Context
`runtime_api/context.py::_ensure_allocators` 는 다음 단계로 allocator 세트를
구성한다:
1. spec 으로부터 `hbm_total_gb_per_cube`, `hbm_slices_per_cube`,
`tcm_size_mb`, target_device 별 SIP 범위 등을 읽음.
2. `AddressConfig` 로 모든 파라미터를 frozen 하게 패킹.
3. target SIP 범위 × cube × PE 의 모든 조합에 대해
`PEMemAllocator(sip, cube, pe, cfg)` 인스턴스를 1개씩 생성.
4. `VirtualAllocator(va_base=0x1_0000_0000, va_size=64 GiB,
page_size=pe_mmu.page_size)` 를 1개 생성.
allocator 들의 책임:
- **PEMemAllocator**: PE-로컬 HBM slice / TCM 의 PA-공간 할당 (PhysAddr
encoding 까지 포함).
- **VirtualAllocator**: device-wide VA 공간을 페이지 정렬로 할당. 이후
`RuntimeContext._create_tensor` 가 VA → PA 매핑을 `MmuMapMsg` 로 fabric
에 push.
이 알고리즘들은:
- **first-fit** 으로 단순.
- 자유 블록 리스트는 **offset 정렬 (sorted by start)** 유지.
- `free()` 시 **양쪽 인접 블록과 coalesce**.
이런 결정의 근거가 어디에도 없으므로, 향후 누군가 "왜 best-fit 이 아닌가",
"왜 buddy allocator 가 아닌가", "왜 partial overlap free 가 silently
허용되는가" 라는 질문에 답할 기준이 필요. 본 ADR 이 그 기준을 마련한다.
## Decision
### D1. `_FreeList` — offset-기반 first-fit + coalescing
`policy/address/allocator.py::_FreeList`:
- 내부 표현: `list[tuple[int, int]]` = `[(start_offset, size), ...]` —
start offset 으로 정렬된 자유 블록의 sorted list.
- `alloc(nbytes)`:
1. free list 를 앞에서부터 순회 (first-fit).
2. 처음 만나는 `size >= nbytes` 인 블록에서 앞부분을 잘라 사용.
3. 정확히 일치하면 블록 통째로 제거; 아니면 `(start+nbytes, size-nbytes)`
로 축소.
4. `_used += nbytes`, 잘라낸 `start` 반환.
5. 맞는 블록이 없으면 `AllocationError("overflow ... largest free block
...")`.
- `free(offset, nbytes)`:
1. `_used -= nbytes`.
2. `bisect_left(self._free, (offset,))` 로 삽입 위치 결정.
3. 직전 블록과 인접 (`prev_start + prev_size == offset`) 하면 흡수.
4. 직후 블록과 인접 (`offset+nbytes == next_start`) 하면 흡수.
5. coalesced range 를 정렬 위치에 insert.
이 알고리즘은 fragmentation 에 약점이 있으나 (best-fit / buddy 대비), 본
시뮬레이터의 워크로드 특성상 (deploy/free 패턴이 거의 stack-like) 충분
하다는 것이 디자인 가정이다. 워크로드가 변하면 D1 supersede 후보.
### D2. partial overlap free 는 **검사하지 않는다**
`_FreeList.free(offset, nbytes)` 는 호출자가 정확한 (offset, nbytes) 를
넘긴다고 신뢰한다. 다음을 검증하지 않는다:
- 그 range 가 실제로 alloc 된 것인지.
- 그 range 가 다른 alloc 된 영역과 겹치지 않는지.
이유: 시뮬레이터 컨텍스트에서 호출자는 항상 `alloc()` 의 반환값을 그대로
저장했다가 `free()` 에 넘기는 패턴이며, 외부 사용자 입력이 아니다. 안전성
검사를 추가하면 매 free 마다 O(N) 비용이 들어 시뮬 wall-clock 에 영향.
이 신뢰 모델이 깨지면 (예: 두 텐서가 같은 PA 를 가리키는 코드 경로 도입)
즉시 ADR-level 으로 재검토.
### D3. `PEMemAllocator` — HBM/TCM 두 채널 분리
`PEMemAllocator(sip_id, die_id, pe_id, cfg)` 는 두 `_FreeList` 를 보유:
- `_hbm`: `cfg.hbm_slice_bytes` 크기.
- `_tcm`: `cfg.tcm_allocatable_bytes` (= `tcm_bytes_per_pe -
tcm_scheduler_reserved_bytes`) 크기.
`alloc_hbm(nbytes) -> PhysAddr`:
- `_hbm.alloc(nbytes)` 로 offset 획득.
- `PhysAddr.pe_hbm_addr(sip_id, die_id, pe_id, pe_local_hbm_offset=offset,
slice_size_bytes=cfg.hbm_slice_bytes)` 로 PA 인코딩.
- 실패 시 `AllocationError("HBM overflow ...")`.
`free_hbm(pa, nbytes)`:
- `pa.hbm_offset - pe_id * cfg.hbm_slice_bytes` 로 PE-local offset 복원.
- `_hbm.free(offset, nbytes)`.
`alloc_tcm(nbytes) -> PhysAddr`: 유사하게 `PhysAddr.pe_tcm_addr` 로 인코딩.
`free_tcm(pa, nbytes)`: `pa.sub_offset` 을 그대로 사용 (TCM 은 PE-local
offset 이 곧 sub_offset).
scheduler-reserved TCM 영역 (`cfg.tcm_scheduler_reserved_bytes`) 은
allocator 가 인지하지 않는다 (`_tcm` 의 capacity 에서 사전 차감되어 있음).
이는 ADR-0014 의 PE_SCHEDULER 내부 buffer 예약과 정합된다.
### D4. `VirtualAllocator` — 페이지 정렬 first-fit + coalescing
`policy/address/va_allocator.py::VirtualAllocator`:
- 내부 표현: `_FreeList` 와 동일한 sorted `list[tuple[int, int]]`.
최초: `[(va_base, va_size)]`.
- `_align_up(nbytes) = ceil(nbytes / page_size) * page_size`.
- `alloc(nbytes) -> int`:
1. `aligned = _align_up(nbytes)`.
2. first-fit 으로 `size >= aligned` 인 블록 탐색.
3. 블록 앞부분 `aligned` 만큼 잘라 사용. 정확히 일치하면 제거.
4. `_used += aligned`. 블록 `start` (= aligned 된 VA) 반환.
5. 실패 시 `VaAllocationError`.
- `free(va, nbytes)`: `_align_up(nbytes)` 단위로 free. _FreeList 와 동일한
coalesce 알고리즘.
`page_size` 의 실제 값은 두 곳에서 다른 기본을 갖는다:
- `VirtualAllocator.__init__` 의 매개변수 기본값: `2 MiB`. 직접 호출하는
테스트가 그대로 받는다.
- `RuntimeContext._ensure_allocators` 가 인스턴스화할 때:
`pe_mmu.attrs.get("page_size", 4096)` — `topology.yaml` 의
`pe_mmu.attrs.page_size` 가 있으면 그 값, 없으면 fallback 4 KiB.
두 기본이 다른 이유: VirtualAllocator 의 standalone 기본은 ADR-0039 의
PE_MMU stopgap 기본 (2 MiB) 과 정합되어 직접 테스트가 자연스럽고, context
fallback 의 4 KiB 는 topology 미설정 시 안전한 minimum page 다. 실제 사용
경로는 항상 후자이며 (`_ensure_allocators` 가 인스턴스화하므로),
`topology.yaml` 에서 `page_size` 가 명시되면 그 값이 양쪽 (MMU + VA
allocator) 으로 일관되게 흐른다.
만약 이 일치가 깨지면 (예: VirtualAllocator 의 page_size 를 PE_MMU 와
다르게 인스턴스화) MMU `map()` 가 서브-페이지 region 모드 (ADR-0039 D3) 로
흐른다.
VA 기본 범위: `va_base = 0x1_0000_0000` (= 4 GiB), `va_size = 64 GiB`. 이
값은 `_ensure_allocators` 에 하드코딩되어 있으며 ADR-0011 의 VA 모델에서
직접적인 의미를 갖지는 않는다 — 단지 host 코드와 충돌하지 않을 만큼 큰
주소 공간을 device-wide 로 잡아둔 것.
### D5. allocator 인스턴스의 lifecycle
- `RuntimeContext._ensure_allocators` 가 lazy 하게 호출됨 (`_create_tensor`
의 첫 호출 시점).
- 한 번 생성된 allocator dict (`self._allocators`) 는 RuntimeContext 의
lifetime 동안 재사용. 같은 process 안의 두 번째 deploy 는 새 객체를
만들지 않는다.
- `RuntimeContext.cleanup()` 이 모든 living tensor 의 `_free_tensor()` 를
호출 → MMU unmap + `va_allocator.free` + `pemem_allocator.free_hbm` 으로
free list 가 원상복구. 다음 RuntimeContext 가 다시 만들면 초기 상태부터.
allocator 상태가 RuntimeContext 간에 공유되지 않는 점이 단일 process 안의
연속 실행에서 deploy → cleanup → deploy 의 결정성을 보장한다.
### D6. Allocator 실패는 raise 한다 (silent OOM 금지)
`_FreeList.alloc` / `VirtualAllocator.alloc` 모두 충분한 free block 이
없으면 `AllocationError` / `VaAllocationError` 를 던진다. 메시지에는
"required size + largest available block" 가 포함되어, fragmentation
인지 진짜 OOM 인지 진단 가능.
silent fallback (예: 가장 큰 블록만큼만 alloc) 는 절대 금지 — 부분 할당된
텐서가 SimPy 단계에 들어가면 라우팅·DMA 가 잘못된 PA 를 인지하여 시뮬
정확도가 깨진다.
### D7. address space 와 allocator 의 1:1 대응
물리 주소 공간 분리는 PhysAddr 의 sub-unit (ADR-0001 D2.3) 으로 표현되며,
각 sub-unit 마다 별도 allocator 인스턴스를 둔다:
- HBM slice → `PEMemAllocator._hbm`.
- PE TCM → `PEMemAllocator._tcm`.
- (현재 미사용) M_CPU local memory, CUBE SRAM → 별도 allocator 필요. 현재
구현은 아직 IPCQ-only slot 으로 처리 (ADR-0023 D9.7) 하며 PA 공간을
share 하지 않으므로 별도 free-list 가 없음.
cube-level SRAM allocator 가 필요해지면 `_FreeList(cfg.sram_bytes_per_cube)`
인스턴스를 cube 단위로 추가한다 (`cfg.sram_bytes_per_cube` 는 이미
`AddressConfig` 에 정의되어 있어 데이터 모델은 준비됨).
## Alternatives Considered
### A1. best-fit / buddy allocator
기각 (현재). 워크로드의 alloc/free 패턴이 stack-like (deploy 순서 = free
순서) 라 first-fit + coalescing 으로 fragmentation 이 충분히 통제된다.
LLM kernel sweep 에서 long-running fragmentation 이 관찰되면 buddy 로
교체하는 ADR 을 별도로 만든다.
### A2. partial overlap free 검증 추가
기각. D2 의 신뢰 모델 + O(N) 검사 비용. 단, 디버그 모드 (`KERNBENCH_DEBUG`
env var 등) 에서 활성화하는 옵션은 후속 작업으로 가능.
### A3. VA 와 PA 의 통합 allocator
기각. VA 공간 (64 GiB device-wide) 과 PA 공간 (slice 별 ~6 GiB) 는 의미
차원이 다르다. VA 는 host kernel 의 view, PA 는 device sub-unit 의 view.
ADR-0011 의 VA 모델 정신 (MMU 가 둘 사이를 매핑) 과 정합하기 위해
allocator 도 분리.
### A4. page_size 의 multi-tier 지원 (large page + small page)
기각 (현재). 단일 page_size (현재 2 MiB) 가 LLM kernel 의 텐서 단위 (수
MiB~수 GiB) 에 맞고, ADR-0039 D3 의 서브-페이지 region 으로 작은 매핑이
필요할 때 흡수된다. multi-tier page 는 MMU 자체 모델을 확장해야 하므로
별도 ADR 후보.
## Consequences
- allocator 알고리즘이 ADR-level 에서 굳어져 (D1·D3·D4), 새로운 시뮬
시나리오에서 fragmentation 이슈가 발생할 때 "여기서 first-fit + coalesce
를 쓰고 있다" 가 명확.
- D2 의 신뢰 모델이 명시되어, 향후 사용자 입력으로부터 직접 alloc/free 를
받는 경로가 도입되면 본 ADR supersede 가 필요함을 일찍 인지 가능.
- D7 의 sub-unit별 allocator 1:1 대응이 명시되어, M_CPU/SRAM 별도 영역이
필요해질 때 어디에 free-list 를 추가해야 하는지 명확.
- `VirtualAllocator` 의 page_size 가 PE_MMU 설정과 일치해야 함이 D4 에
적혀 있어, 향후 topology.yaml 의 page_size 변경 시 ADR-0039 stopgap 동작
과의 상호작용을 빠르게 가늠 가능.
@@ -0,0 +1,231 @@
# ADR-0049: `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
## Status
Accepted (2026-05-22).
`probes/probe.py``run_probe(...)` 가 노출하는 traffic-pattern catalog,
formula vs actual 비교, 그리고 monotonicity / D2H≥H2D 같은 invariant
체크의 의미를 명시한다. ADR-0010 (CLI surface) 가 `kernbench probe`
subcommand 를 enumerate 하나, **probe 가 실제로 측정하는 것**과 **어떤
invariant 를 PASS/FAIL 로 판정하는가**는 ADR-level 에 없었다.
## First action (제일 처음에 하는 일)
`run_probe(topology_path, case_filter=None)` 의 첫 4가지 작업:
1. `Path(topology_path).expanduser().resolve()` 로 절대 경로 산출.
2. `load_topology(path)``TopologyGraph` 인스턴스 (그래프 + spec).
3. `_build_edge_map(graph)``{(src, dst): Edge}` 빠른 lookup 테이블.
4. `AddressResolver(graph)` + `PathRouter(graph)` 인스턴스화.
그 다음 `nbytes = 32768` (= 32 KiB, summary table 의 기준 데이터 크기) 와
`show_all = (case_filter is None or case_filter == "all")` 를 설정.
즉, **probe 의 첫 일은 "토폴로지를 한 번 로드하여 edge map / resolver /
router 를 준비하고, 32 KiB 라는 표준 측정 크기를 픽스하는 것"**. 그 이후
H2D → D2H → PE DMA 세 카테고리의 case 들이 각각 별도의 `GraphEngine`
인스턴스에서 실행된다 (case 간 cross-talk 차단).
## Context
`kernbench probe` 는 다음 의도로 도입된 verification 도구다:
- **수동 분석 ground truth**: 실 시뮬레이션 (`kernbench run --bench ...`)
결과의 latency 가 비정상으로 보일 때, 단순 traffic pattern 의 정답을 별도
로 얻어 비교.
- **formula vs actual 비교**: 분석 모델 (wire latency + overhead + drain)
과 시뮬레이션 결과 (`total_ns`) 가 일치하는지 확인. 일치하지 않으면 모델
단순화 가정 (ADR-0033) 어디가 빠진 것인지 단서.
- **monotonicity check**: hop 수가 늘면 latency 가 단조 증가해야 한다는
invariant 의 자동 확인.
- **utilization sweep**: 데이터 크기 (4 KiB ~ 1 MiB) 별 BW 활용률 표.
이 도구의 동작 사양이 ADR-level 에 없으면:
- 다른 형식의 traffic pattern (예: MCpuDma, IPCQ) 을 추가하려는 사람이 기존
카테고리의 표 포맷 / 측정 단위를 일관되게 따르기 어렵다.
- monotonicity 가 무엇을 기준으로 검사되는지 (hop 수? cube 거리? wire
길이?) 모호.
- 32 KiB 라는 기준 크기와 `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]` sweep
의 의미가 코드 grep 으로만 확인 가능.
## Decision
### D1. 세 가지 case category — H2D / D2H / PE DMA
각 category 는 토폴로지 상 별개의 데이터 경로를 가지며, 별도의 summary
table + sweep table + route detail block 으로 출력된다.
- **H2D (Host→Device Write)**: `MemoryWriteMsg(dst_sip=0, dst_cube,
dst_pe=0, pattern="zero")` 가 `pcie_ep → io_cpu → m_cpu → hbm_ctrl` 경로
를 흐른다. cube 인덱스로 hop 수가 증가:
- h2d-1hop: cube=0, hops=1
- h2d-2hop: cube=4, hops=2
- h2d-3hop: cube=8, hops=3
- h2d-4hop: cube=12, hops=4
- **D2H (Device→Host Read)**: `MemoryReadMsg(src_sip=0, src_cube, src_pe=0)`.
forward command path + reverse data path 의 합 latency. 같은 4 hops
카테고리.
- **PE DMA (PE-initiated)**: `PeDmaMsg(src_sip, src_cube, src_pe, dst_pa)`.
5 가지 케이스로 cube/PE 위치 변화:
- pe-local-hbm: same cube, same PE
- pe-same-half-hbm: same cube, different PE (PE 1)
- pe-cross-half-hbm: same cube, far PE (PE 4)
- pe-cross-cube-hbm-best: adjacent cube (cube 1)
- pe-cross-cube-hbm-worst: diagonal far cube (cube 15)
cube 인덱스가 4/8/12 (H2D), 1/4/15 (PE DMA) 같이 의미 있는 이유는
4x4 cube mesh (sip.cube_mesh.w=4, h=4) 에서의 거리 정의 — 추후 cube_mesh
크기 변경 시 이 값들이 같이 갱신되어야 한다.
### D2. 표준 측정 크기 — `nbytes = 32768` (32 KiB)
모든 case 의 summary table 은 `nbytes=32768` 로 한 번 실행한 결과를
보여준다. 32 KiB 가 선택된 이유:
- DMA overhead 와 BW drain 이 한쪽으로 치우치지 않는 적당한 크기.
- 다수 sub-unit (TCM, register file) 의 1회 transfer 단위와 비교 가능.
크기별 utilization 변화는 별도 sweep table 이 보여준다 (D3).
### D3. Utilization sweep — `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]`
`SWEEP_SIZES = [4096, 16384, 65536, 262144, 1048576]`, `SWEEP_LABELS =
["4KB", "16KB", "64KB", "256KB", "1MB"]`. 매 size 마다 다음 공식:
```
drain = nbytes / bottleneck_bw
total = overhead + wire + drain
eff_bw = nbytes / total
util% = eff_bw / bottleneck_bw × 100
```
`bn_bw is None or <= 0` 이면 그 컬럼은 0.0 % 로 출력. 의미: hop 수가 늘
수록 작은 transfer 는 overhead-bound, 큰 transfer 는 drain-bound 가 되는
패턴을 한 표에서 확인.
### D4. 측정 항목 — actual / formula / breakdown
각 case 행에 표시되는 컬럼:
- `Actual` (total_ns): SimPy 실행 결과의 `trace["total_ns"]`.
- `Ovhd`: 경로상 모든 node 의 `node.attrs["overhead_ns"]` 합 (formula
breakdown).
- `Drain`: `nbytes / min(edge.bw_gbs over path)` (formula).
- `Wire`: `Σ edge.distance_mm * (ns_per_mm from spec)`.
- `Ovhd%` / `Drain%`: Ovhd/Drain 이 Actual 에서 차지하는 비율 (formula 의
Wire 는 통상 매우 작아 표시하지 않음).
- `Eff.BW`: `nbytes / total_ns` (실 측정 BW).
- `BN.BW`: bottleneck bandwidth (formula). path 상 모든 edge 의 BW 중 최소.
edge BW 가 없으면 "-".
- `Util%`: `Eff.BW / BN.BW × 100`. 100% 면 single-stream BW upper bound 에
도달.
formula 의 합 (`wire + ovhd + drain`) 과 actual 의 차이가 크면 모델
단순화가 잡지 못하는 요소가 있다는 신호 (ADR-0033 의 가정 점검).
### D5. Invariant 자동 체크 — PASS/FAIL
다음 invariant 들이 자동으로 확인되어 `[v] PASS` / `[x] FAIL` 로 출력:
- **H2D / D2H monotonic increase**: hop 수가 늘면 actual latency 가
단조 증가해야 함. `all(lats[i] < lats[i+1] for ...)`.
- **D2H ≥ H2D**: 같은 hop 인덱스에서 D2H ≥ H2D (D2H 는 forward command
+ reverse data 두 leg 이므로). `all(d2h[i].total >= h2d[i].total)`.
- **PE DMA best < worst**: cross-cube best (adjacent) latency < cross-cube
worst (diagonal) latency.
- **PE DMA local vs remote**: local BN BW vs remote BN BW 의 비교 출력
(PASS/FAIL 이 아닌 정보성).
체크가 FAIL 이면 사람이 즉시 모델/토폴로지 회귀를 인지할 수 있도록 한
줄로 분명하게 출력.
### D6. Route detail — per-hop timestamp trace
summary 와 sweep 표 이후 각 case 의 path 와 per-hop 누적 시간 (
`_hop_timestamps`) 가 별도 섹션에서 출력된다:
- H2D: leg1 (`pcie_ep → io_cpu`) + leg2 (`io_cpu → m_cpu`) + leg3
(`m_cpu → hbm_ctrl`) + per-hop trace.
- D2H: forward (cmd, no data) + reverse (data) trace 분리 표시.
- PE DMA: `pe_dma → router → hbm_ctrl` path + per-hop trace.
각 hop 의 timestamp 는 cumulative `wire_ns + overhead_ns` 누적. terminal
hop 의 annotation 에 `drain:Xns` 가 붙는다. bottleneck edge 는
`<BN:XXGB/s>` 로 표시되어 시각적으로 식별 가능.
### D7. case_filter 인자의 의미
- `None` 또는 `"all"`: 모든 case 실행 (default).
- 다른 문자열: 그 이름과 정확히 일치하는 case 만 실행. 예: `kernbench
probe --case h2d-2hop`.
각 카테고리 안에서 `name != case_filter` 면 skip 되며, 그 카테고리의
monotonicity / D2H≥H2D 비교는 데이터가 1개일 때 자연히 skip 된다.
CLI parser 의 `--case` 기본값은 `"all"`이라 인자 생략 시 전체 실행.
### D8. 매 case 별 fresh GraphEngine
H2D 4개, D2H 4개, PE DMA 5개의 case 가 각각 **새로운 GraphEngine**
인스턴스에서 실행된다 (`engine = GraphEngine(graph)`). 이유:
- case 간 누적 상태 (op_log, completion 추적, allocator 등) 가 cross-talk
하지 않도록 격리.
- 한 case 의 traffic 이 다른 case 의 BW 측정에 영향을 주지 않도록 보장.
이 격리는 probe 의 측정 결과를 **각 case 단독 single-flow** 의 latency 로
해석할 수 있게 한다. multi-flow contention 측정은 별도 도구 (예:
`pe2pe_overview` 플롯, ADR-0033 의 multi-flow merging 모델) 책임.
### D9. 출력 포맷의 안정성
probe 의 stdout 출력은 사람이 읽기 위함이며, 정확한 컬럼 폭/구분자/공백 은
machine-readable contract 가 아니다. 자동화된 도구가 probe 결과를 파싱
하려면 별도 JSON 출력 모드를 추가해야 한다 (현재 미구현).
PASS/FAIL 줄의 `[v]` / `[x]` 접두사는 CI grep 용 anchor 로 안정 보장.
## Alternatives Considered
### A1. Probe 를 별도 bench 로 등록 (`@bench(name="probe")`)
기각. probe 는 bench 가 아니라 verification 도구로 의도된다 — sweep / 분석
용 multi-engine 실행과 invariant PASS/FAIL 출력이 본질이며, ADR-0045 의
"단일 디바이스 + 단일 RuntimeContext" bench 모델과 맞지 않는다.
### A2. monotonicity 위반 시 exit code 1
기각 (현재). 인간 검사 도구 위주로 의도되어 있어 PASS/FAIL 줄을 출력하고
exit 0 로 종료. CI 가 violation 으로 fail 하길 원하면 별도 wrapper 가
`grep "\[x\]"` 결과로 판단하면 됨. 후속으로 strict-mode flag (`--strict`)
도입 가능.
### A3. probe 의 case 정의를 외부 YAML 로
기각 (현재). 8개 case (4 H2D + 4 D2H + 5 PE DMA — 합 13개) 는 코드에
하드코딩되어 있고 의미가 토폴로지 mesh 구조에 단단히 묶여 있다. 외부
YAML 로 옮기면 cube 인덱스의 의미 (4, 8, 12 / 1, 4, 15) 를 별도로 문서화
해야 하므로 응집도 손실. 케이스 추가가 잦아지면 그때 별도 ADR 로 도입.
### A4. multi-flow contention 측정 추가
기각 (probe 범위 밖). D8 에서 명시한 single-flow 격리 모델이 probe 의 핵심
의도. multi-flow contention 은 ADR-0033 latency model 의 다른 영역으로,
별도 도구 또는 별도 case category 로 처리.
## Consequences
- probe 의 case catalog (D1) 와 측정 단위 (D2/D3) 가 ADR-level 에서 명시
되어, 새 traffic 카테고리 추가 시 어떤 표 포맷을 따라야 하는지 분명.
- formula vs actual 의 컬럼 의미 (D4) 가 굳어져, probe 결과를 보고 "왜
Drain% 가 5% 인가 / 70% 인가" 같은 질문을 빠르게 ADR-0033 가정 점검으로
연결 가능.
- invariant 자동 체크 (D5) 가 ADR 에 굳어져, 향후 latency 모델 변경 시
monotonicity / D2H≥H2D 회귀를 probe 가 즉시 잡아낸다는 안전망 정착.
- D8 의 case 간 격리가 명시되어, probe 결과를 single-flow 측정으로 안전
하게 해석 가능. multi-flow 측정이 필요해지면 별도 도구 트랙이 필요함이
분명.
- A2 의 strict-mode flag 가 후속 작업 후보로 기록되어, CI 통합 요구 시
최소 추가 작업으로 도입 가능.
@@ -0,0 +1,308 @@
# ADR-0050: CCL Algorithm Module Contract — `ccl/algorithms/*.py`
## Status
Accepted (2026-05-22).
`src/kernbench/ccl/algorithms/` 디렉터리 안의 모듈이 AHBM CCL backend
(ADR-0047) 에서 collective algorithm 으로 사용되려면 갖춰야 할 인터페이스,
kernel 시그너처, 그리고 새 알고리즘 추가 절차를 명시한다. ADR-0047 D3 가
"algorithm 모듈은 `kernel`, `kernel_args`, optional `TOPO_NAME_TO_KIND`
expose 해야 한다" 라고만 한 줄로 언급하나, **algorithm 모듈 작성자가 따라야
할 contract** 는 ADR-level 에서 정리된 적이 없다. ADR-0045 가 bench 모듈
contract 를 다루는 것과 짝을 이룬다.
## First action (제일 처음에 하는 일)
알고리즘 모듈이 import 되는 시점은 두 가지다:
1. **AHBM backend 진입**: 사용자 코드가 `dist.init_process_group(backend="ahbm")`
를 호출하면, `AhbmCCLBackend.__init__` 안에서 `self._algo_module =
importlib.import_module(self._merged["module"])` 가 실행된다. 이때 모듈
레벨에서 가장 먼저 일어나는 일:
- `SIP_TOPO_RING/TORUS/MESH` 같은 정수 상수가 모듈 namespace 에 노출.
- `TOPO_NAME_TO_KIND` 사전이 모듈 namespace 에 노출 — backend 가
`topo_map = getattr(self._algo_module, "TOPO_NAME_TO_KIND", None)`
조회.
- `kernel_args` 함수 정의 — 호출 시 호출자가 사용.
- `allreduce_intercube_multidevice` 같은 알고리즘 함수 정의.
- 모듈 마지막 줄에서 `kernel = allreduce_intercube_multidevice`
alias 가 노출.
2. **ccl.yaml install 단계**: `kernbench.ccl.install.install_ipcq` 가 호출
되어 IPCQ neighbor table 을 푸시할 때 같은 알고리즘 모듈이 import 됨.
즉, **algorithm 모듈의 첫 일은 "topology-kind 상수, `TOPO_NAME_TO_KIND`
사전, `kernel_args` 함수, 그리고 `kernel` alias 를 모듈 namespace 에 노출
하는 것"** 이다. 모든 노출은 import-time 부수효과로 충분하며 별도 초기화
함수 호출이 필요하지 않다.
## Context
`AhbmCCLBackend` (ADR-0047) 는 process group 초기화 시점에 `ccl.yaml`
`defaults.algorithm` (또는 사용자가 지정한 알고리즘 이름) 으로부터 모듈
경로를 얻어 dynamic import 한다. backend 는 그 모듈로부터 다음 4 가지를
기대한다:
- `kernel`: collective 의 진입 함수.
- `kernel_args(world_size, n_elem, cube_w=, cube_h=) -> tuple`: kernel 에
넘길 위치 인자 묶음.
- `TOPO_NAME_TO_KIND` (optional): `topology.yaml``sips.topology`
문자열 (예: `"ring_1d"`, `"torus_2d"`, `"mesh_2d_no_wrap"`) 을 정수
상수로 매핑하는 dict.
- (간접) IPCQ neighbor table 설치: `configure_sfr_intercube_multisip`
알고리즘 모듈의 `TOPO_NAME_TO_KIND``cube_w/h` 를 보고 SFR 을 결정.
현재 코퍼스의 유일한 algorithm 모듈은 `lrab_hierarchical_allreduce.py`
(248 줄) 이다. 이름은 "**l**eft-**r**ight **a**lternating **b**roadcast
**hierarchical allreduce**". 향후 `ring_allreduce`, `tree_allreduce`,
`broadcast` 같은 모듈이 추가될 때마다 이 contract 를 따라야 일관된
디스패치가 가능하다.
이 contract 가 ADR-level 에 없으면:
- 새 algorithm 작성자가 ADR-0047 D3 의 한 줄 만으로 시그너처를 추론해야.
- kernel 함수 인자 순서 (특히 `t_ptr, n_elem, cube_w, cube_h, n_sips,
sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl`) 의 의미가 코드
grep 없이는 명확하지 않다.
- `kernel_args` 가 어떤 인자를 받고 어떤 tuple 을 돌려줘야 하는지 관례
로만 굳어진다.
## Decision
### D1. algorithm 모듈은 4 가지 public symbol 을 노출한다
```python
# src/kernbench/ccl/algorithms/<name>.py
from __future__ import annotations
# (필수) topology-kind 상수 — 알고리즘 내부에서 사용
SIP_TOPO_RING = 0
SIP_TOPO_TORUS = 1
SIP_TOPO_MESH = 2
# (선택) topology 이름 → kind 매핑. backend 가 ccl.yaml/topology 의
# 문자열 SIP topology 를 정수로 변환하는 데 사용.
TOPO_NAME_TO_KIND = {
"ring_1d": SIP_TOPO_RING,
"torus_2d": SIP_TOPO_TORUS,
"mesh_2d_no_wrap": SIP_TOPO_MESH,
}
# (필수) kernel 인자 빌더
def kernel_args(world_size: int, n_elem: int, *, cube_w: int = 4, cube_h: int = 4) -> tuple:
return (n_elem, cube_w, cube_h, world_size)
# (필수) kernel 함수 (`tl=...` 키워드를 통해 TLContext 가 주입됨)
def my_allreduce_kernel(t_ptr, n_elem, cube_w, cube_h, n_sips,
sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, *, tl):
...
# (필수) kernel alias — backend 가 `module.kernel` 로 접근
kernel = my_allreduce_kernel
```
- `kernel` alias 는 backend 가 직접 호출하는 entry point 다. 함수 이름이
무엇이든 (`allreduce_intercube_multidevice` 처럼) `module.kernel = fn`
으로 노출해야 한다.
- `kernel_args` 가 없으면 backend 가 알고리즘 인자를 만들 방법이 없다.
signature 는 D2 참고.
- `TOPO_NAME_TO_KIND` 가 없으면 backend 는 `sip_topo_kind = 0` 으로
fallback 한다. 단일 topology 만 지원하는 알고리즘이라면 생략 가능.
### D2. `kernel_args` 시그너처 — `(world_size, n_elem, *, cube_w, cube_h)`
```python
def kernel_args(world_size: int, n_elem: int, *,
cube_w: int = 4, cube_h: int = 4) -> tuple:
return (n_elem, cube_w, cube_h, world_size)
```
- **위치 인자**: `world_size` (= rank 수), `n_elem` (= 단일 shard 의
element 수, f16 기준).
- **키워드 인자**: `cube_w`, `cube_h` (= cube mesh 크기). default 는
4×4 — `topology.yaml` 의 `sip.cube_mesh` 기본값과 정합.
- **반환**: kernel 의 위치 인자 순서대로 묶은 tuple.
backend 의 `all_reduce` 가 호출 시:
```python
kernel_args_tuple = self._algo_module.kernel_args(
self._world_size, n_elem, cube_w=eff_cube_w, cube_h=eff_cube_h,
)
extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)
pending = self.ctx.launch(
self._merged["algorithm"], kernel_fn, tensor,
*kernel_args_tuple, *extra_args, _defer_wait=True,
)
```
즉 kernel 의 최종 위치 인자는: `(tensor_ptr, *kernel_args_tuple,
sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` 이며, 거기에 `tl=...` 가
키워드로 자동 주입된다. `kernel_args` 가 돌려주는 tuple 의 길이/순서는
**kernel signature 와 1:1 일치** 해야 한다.
### D3. `kernel` 함수 시그너처 — 정형화된 9 + tl 인자
권장 시그너처:
```python
def my_kernel(
t_ptr: int, # VA base of the row-wise-sharded tensor on this SIP
n_elem: int, # element count per cube tile (or per shard)
cube_w: int, # cube mesh width (kernel_args 에서 옴)
cube_h: int, # cube mesh height (kernel_args 에서 옴)
n_sips: int, # world_size 와 동일 (rank = SIP, ADR-0024)
sip_rank: int, # 이 SIP 의 rank
sip_topo_kind: int, # TOPO_NAME_TO_KIND lookup 결과
sip_topo_w: int, # SIP mesh width (ring_1d 면 0)
sip_topo_h: int, # SIP mesh height (ring_1d 면 0)
*, tl, # TLContext (auto-injected)
) -> None:
```
`kernel_args` 가 다른 위치 인자 순서를 채택하더라도, kernel 의 **마지막
4 개 위치 인자는 항상 `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)`**
이며 backend 가 `extra_args` 로 append 한다 (ADR-0047 D5). 이 4 개 인자는
사용자 정의 algorithm 도 받아야 하지만, 알고리즘이 single-SIP 이라면
그냥 무시하면 된다.
`tl` 은 위치 인자가 아닌 키워드로 주입된다 — `RuntimeContext.launch` 가
kernel 호출 직전에 `tl=tl_ctx` 를 추가한다. 따라서 kernel signature 의
`tl` 은 keyword-only (`*, tl`) 또는 마지막 키워드 매개변수 형태여야
한다.
### D4. kernel body 의 자유도와 제약
kernel body 안에서 사용 가능한 표면: ADR-0046 D3 의 모든 `tl.*` primitive.
특히 자주 쓰이는 패턴:
- `cube_id = tl.program_id(axis=1)` — 이 PE 가 속한 cube 인덱스.
- `pe_addr = t_ptr + cube_id * nbytes` — cube-별 tile 의 VA 계산.
- `acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")` — local 데이터
로드.
- `tl.send(dir=...)` / `tl.recv(dir=..., shape=, dtype=)` — IPCQ
collective.
- `acc = acc + recv` — TensorHandle 산술 연산자 (ADR-0046 D4).
- `tl.store(pe_addr, acc)` — 결과 저장.
kernel body 는 일반 Python 함수이며, branching/looping 자유. 단:
- SimPy `yield` 또는 `async` 금지 (ADR-0046 D1).
- TensorHandle 의 `.data` 직접 접근 금지 — phase 1 timing 모델은
데이터 의존을 모른다 (ADR-0020 의 2-pass 분리).
- kernel 실행은 deterministic 해야 한다 — 같은 입력으로 두 번 실행하면
같은 op 시퀀스 발사. random / external IO 금지.
### D5. SIP topology semantics — `sip_topo_kind` 의 의미
backend 가 `topology.yaml` 의 `system.sips.topology` 문자열을 algorithm
모듈의 `TOPO_NAME_TO_KIND` 로 lookup 하여 `sip_topo_kind` 정수로 변환.
algorithm 은 이 정수를 보고 분기:
```python
if sip_topo_kind == SIP_TOPO_RING:
acc = _inter_sip_ring(...)
elif sip_topo_kind == SIP_TOPO_TORUS:
acc = _inter_sip_torus_2d(...)
elif sip_topo_kind == SIP_TOPO_MESH:
acc = _inter_sip_mesh_2d(...)
```
각 topology branch 는 IPCQ direction 이름 (예: `"global_E"`, `"W"`, `"S"`,
`"N"`) 을 통해 peer 와 통신. direction 의 의미는 ADR-0023/0025 가 정의
하며, `configure_sfr_intercube_multisip` 가 IPCQ neighbor table 을 그에
맞춰 설치한다.
algorithm 모듈은 자기가 지원하지 않는 topology kind 가 들어오면 silent
no-op 으로 두기보다 명시적으로 `raise ValueError(f"unsupported topology
kind {sip_topo_kind}")` 하는 것을 권장 — 실수로 backend 에 잘못 dispatch
된 경우 빠르게 fail.
### D6. ccl.yaml 의 algorithm entry 구조
algorithm 모듈은 `ccl.yaml` 의 entry 와 짝을 이룬다 (ADR-0023 D10 +
ADR-0047 D3):
```yaml
defaults:
algorithm: lrab_hierarchical_allreduce
n_elem: 8
algorithms:
lrab_hierarchical_allreduce:
module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
# optional: world_size override
# optional: per-algorithm parameters consumed by configure_sfr_intercube_multisip
```
- `module`: full Python module path. backend 의 `importlib.import_module`
가 이 문자열을 그대로 사용.
- `world_size` (optional): 명시되면 topology fallback 을 override
(ADR-0047 D2).
- algorithm-specific parameters 는 `configure_sfr_intercube_multisip` 가
소비.
새 algorithm 추가 시:
1. `src/kernbench/ccl/algorithms/<name>.py` 작성 (D1 컨벤션).
2. `ccl.yaml` 의 `algorithms` 섹션에 entry 추가.
3. (필요 시) `kernbench.ccl.sfr_config` 에 SFR 설치 분기 추가.
4. test 추가 (예: `tests/sccl/test_<name>.py`, ADR-0043 의 eval harness
확장).
### D7. legacy "rank = flat PE index" 모드
ADR-0047 D2 가 명시한 `ccl.yaml` 의 `world_size` override 경로는 legacy
"rank = flat PE index" 테스트가 사용한다. algorithm 모듈은 이 모드 에서도
`n_sips=world_size` 만큼의 rank 가 들어옴을 가정하면 된다 — backend 가
rank↔(SIP, cube, PE) 매핑을 사전에 분리해 두므로 algorithm 본체에서는
modal 분기가 필요 없다.
단, single-cube workload 에서는 `cube_w=cube_h=1` 이 들어와 mesh-기반
phase 들이 skip 되도록 작성해야 한다 (`lrab_hierarchical_allreduce.py`
의 `single_cube = (cube_w == 1 and cube_h == 1)` 패턴 참고).
## Alternatives Considered
### A1. algorithm 모듈을 class 로 구조화 (`class Allreduce: kernel(...)` 등)
기각. Python 모듈 namespace 자체가 algorithm 의 identity 로 사용 중이며
(ADR-0047 D3 의 `importlib.import_module`), class 한 겹은 추가 indirection
만 늘리고 dispatch 측 코드를 두텁게 만든다. 모듈-레벨 free function
+ `kernel` alias 패턴이 충분히 명확.
### A2. `kernel_args` 를 명시적 dataclass 로 typing
기각 (현재). algorithm 마다 인자 갯수가 다른 것이 정상이며, dataclass 한
종류를 강제하면 다양한 algorithm 간 호환이 어려워진다. tuple 반환은 simple
하고 backend 측 `*kernel_args_tuple` unpacking 과 깨끗이 맞물린다.
algorithm 별 자체 타입 강도가 필요해지면 그 algorithm 모듈 안에서 NamedTuple
사용은 자유.
### A3. SFR 설치를 algorithm 모듈 안으로
기각. SFR 설치 (`configure_sfr_intercube_multisip`) 는 topology + algorithm
모두를 보고 IPCQ neighbor table 을 설치하는 cross-module 결정이라, algorithm
모듈 내부보다 `kernbench.ccl.sfr_config` 같은 전용 위치가 자연스럽다. D6 의
"필요 시 sfr_config 분기 추가" 워크플로우가 책임 분리 측면에서 더 명확.
### A4. algorithm name 을 모듈 namespace 에 자동 등록 (decorator)
기각. ADR-0045 (bench) 와 달리 algorithm 은 ccl.yaml entry 와 직접 묶여
있어 추가 등록 레지스트리가 중복이다. `module` 문자열 매핑 하나면 충분.
## Consequences
- ADR-0047 D3 의 한 줄 contract 가 D1–D7 의 작성자-친화적 가이드로 확장
되어, 새 algorithm 추가 시 시그너처를 grep 으로 추론할 필요 없음.
- D3 의 9 + tl 인자 시그너처가 표준화되어, backend 의 `extra_args` append
(ADR-0047 D5) 와 자연스럽게 맞물림. 향후 single-SIP-only algorithm 도
4 개의 sip_* 인자를 받아야 함이 명시.
- D5 의 fail-loud 권장으로, ccl.yaml 의 topology 가 algorithm 미지원
topology 로 잘못 설정되면 backend 가 silent wrong-result 가 아닌
ValueError 로 fail.
- D6 의 단계별 추가 절차가 명시되어, 새 algorithm 추가가 sfr_config /
test / ccl.yaml 어디까지 손대야 하는지 분명.
@@ -0,0 +1,267 @@
# ADR-0051: Routing Helper API — `AddressResolver` + `PathRouter`
## Status
Accepted (2026-05-22).
`policy/routing/router.py` 가 노출하는 두 helper 클래스
(`AddressResolver`, `PathRouter`) 의 모든 public API, 인자, 반환 값,
그리고 네 가지 다른 adjacency graph 의 사용처를 명시한다. ADR-0002 가
routing distance 와 ordering, bypass 규칙을 정의하나, **helper API 표면
자체** 는 ADR-level 에 정리된 적이 없다.
## First action (제일 처음에 하는 일)
### `AddressResolver(graph)`
생성 즉시 다음 두 가지를 캐시한다:
1. `self._node_ids = set(graph.nodes)` — 모든 node id 의 set (lookup 용).
2. `self._hbm_slice_bytes = hbm_total_gb * (1 << 30) // slices_per_cube`
`graph.spec.cube.memory_map` 으로부터 산출 (기본 `48 GB / 8 slices = 6
GB`). 이 값이 `resolve()` 가 HBM PA 의 `hbm_offset` 에서 `pe_id`
복원하는 데 쓰인다.
즉, **AddressResolver 의 첫 일은 "전체 node id 집합과 HBM slice 크기를
미리 계산해 두는 것"** 이다. graph 자체는 보유하지 않는다.
### `PathRouter(graph)`
생성 즉시 **네 개의 별도 adjacency graph 를 동시 구축**한다:
1. `self._adj_all`: 모든 edge 포함 (component-to-component routing 용).
2. `self._adj`: `kind != "command"` 인 edge 만 (PE DMA / 일반 data path).
3. `self._adj_mcpu_dma`: `_MCPU_DMA_EXCLUDE = {"pe_internal",
"pe_to_router"}` 를 제외 (M_CPU DMA 가 PE pipeline 노드로 잘못 라우팅
되지 않게).
4. `self._adj_local`: `_UCIE_KINDS` 8 종을 제외 (cube-local routing 용 —
UCIe 가 zero-distance bus 처럼 보여 Dijkstra 가 mesh 보다 선호하는
것을 막음).
각 그래프는 `defaultdict(list)` of `(neighbor, weight)` 형태이며,
`edge.routing_weight_mm or edge.distance_mm` 이 weight 로 쓰인다.
즉, **PathRouter 의 첫 일은 "topology edge 들을 4개의 다른 정책으로 동시
분류하여 4 개의 인접 리스트로 구축하는 것"**. 매 `find_*()` 호출 시 적절
한 그래프를 골라 Dijkstra 를 돌린다.
## Context
`policy/routing/router.py` 는 다음 두 책임을 함께 수행한다:
- **이름 매핑**: 토폴로지 명명 규칙 (`sip{S}.cube{C}.<comp>`,
`sip{S}.io{I}.pcie_ep` 등) 의 단일 소유자. 컴포넌트 / probe / IPCQ
install / runtime API 가 이름 문자열을 직접 만들지 않고 helper 를 호출.
- **경로 결정**: edge 의 `kind` 에 따른 정책 분리. 같은 src→dst 라도
routing 의도 (PE DMA vs M_CPU DMA vs general component routing) 에 따라
다른 adjacency 를 사용해야 결과가 달라진다.
이 helper API 가 코드 전반에서 광범위하게 소비되는데도 (probe.py /
distributed.py / install.py / 각종 component / tests), ADR-level 에서
**정확한 시그너처 / 반환 의미 / 어떤 adjacency 를 쓰는지** 가 한 곳에
정리되어 있지 않다. 본 ADR 이 그 빈자리를 채운다.
## Decision
### D1. `AddressResolver` 의 5 개 public API
#### D1.1. `resolve(addr: PhysAddr) -> str`
`PhysAddr` 인스턴스를 토폴로지의 destination node id 로 변환.
```
addr.kind == "hbm" → f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
where pe_id = addr.hbm_offset // self._hbm_slice_bytes (ADR-0017 D4/D9)
addr.kind == "pe_resource":
addr.unit_type == PE → f"sip{s}.cube{d}.pe{addr.pe_id}.pe_tcm"
addr.unit_type == SRAM → f"sip{s}.cube{d}.sram"
addr.unit_type == MCPU → f"sip{s}.cube{d}.m_cpu"
그 외 → RoutingError("unsupported unit_type")
다른 kind → RoutingError("unsupported address kind")
```
산출된 node id 가 `self._node_ids` 에 없으면 `RoutingError(f"node {node_id}
not found in topology")`. 즉, address 의 syntax 가 valid 해도 topology 에
실제로 매핑되는 노드가 없으면 fail-loud.
#### D1.2. `find_m_cpu(sip, cube) -> str`
`f"sip{sip}.cube{cube}.m_cpu"`. 없으면 `RoutingError`.
#### D1.3. `find_pcie_ep(sip, io_id="io0") -> str`
`f"sip{sip}.{io_id}.pcie_ep"`. 없으면 `RoutingError`.
#### D1.4. `find_io_cpu(sip, io_id="io0") -> str`
`f"sip{sip}.{io_id}.io_cpu"`. 없으면 `RoutingError`.
#### D1.5. `find_all_pcie_eps() -> list[str]`
전 SIP 의 PCIE_EP node id 를 정렬된 리스트로 반환. `endswith(".pcie_ep")`
필터링. cross-SIP IPCQ 가 모든 PCIE_EP 를 enumerate 할 때 사용.
명명 규칙 (`sip{S}.cube{C}.<comp>`, `sip{S}.{io_id}.<comp>`) 의 단일
소유자가 이 클래스다 (ADR-0015 D4). 토폴로지 빌더가 같은 명명 규칙으로
노드를 만들고, 컴포넌트는 이름 문자열을 절대 직접 구성하지 않는다 —
모두 helper 를 거친다.
### D2. `PathRouter` 의 4 개 adjacency graph
생성자가 한 번에 구축. edge `kind` 가 정책을 결정:
| graph | 제외 edge kinds | 용도 |
|-------------------|-----------------------------------------------|--------------------------------------------|
| `_adj_all` | (none) | M_CPU↔NOC command 포함, IO_CPU/M_CPU routing |
| `_adj` | `"command"` | PE DMA / 일반 data path |
| `_adj_mcpu_dma` | `"pe_internal"`, `"pe_to_router"` | M_CPU DMA (PE pipeline 우회) |
| `_adj_local` | `_UCIE_KINDS` (`ucie_internal`, `ucie_conn_to_router`, `router_to_ucie_conn`, `ucie_conn_to_noc`, `noc_to_ucie_conn`, `ucie_mesh`, `io_to_cube`, `cube_to_io`) | same-cube routing (UCIe bus 우회) |
각 그래프는 `dict[node_id, list[(neighbor, weight)]]` 이며, weight 는
`edge.routing_weight_mm or edge.distance_mm`. command edge 의 routing
영향력을 명시적으로 가르고, UCIe 의 "0-distance bus" 가 mesh 보다 선호
되는 것을 막기 위한 `_adj_local` 분리가 ADR-0017 D7 의 cross-PE-slice
mesh-distance 요구와 정합.
### D3. `PathRouter` 의 6 개 public API (+ 2 backward-compat)
#### D3.1. `find_path(src_pe: str, dst_node: str) -> list[str]`
**PE DMA routing**. `src_pe` 는 PE prefix (예: `"sip0.cube0.pe0"`) 이며,
함수가 `.pe_dma` 를 자동으로 prepend 하여 실제 시작 노드를
`"sip0.cube0.pe0.pe_dma"` 로 설정.
cube-local 여부 (`_same_cube`) 에 따라 adjacency 선택:
- **same-cube** (src 와 dst 가 `sip{S}.cube{C}.` prefix 공유):
`_adj_local` 사용. UCIe 우회를 막아 cross-PE-slice 가 mesh 거리를 정확
히 지불 (ADR-0017 D7).
- **cross-cube**: `_adj` 사용. UCIe 가 자연스럽게 cross-cube path 의
최적 선택지로 포함됨.
#### D3.2. `find_path_with_distance(src_pe, dst_node) -> tuple[list[str], float]`
D3.1 과 동일한 adjacency 정책을 사용하나, 결과로 `(path, total_distance)`
를 함께 반환. probe / 분석 도구에서 distance 메트릭이 필요할 때 사용.
#### D3.3. `find_mcpu_dma_path(m_cpu_id: str, dst_hbm_id: str) -> list[str]`
**M_CPU DMA path**. cube 가 같으면 `_adj_local` (mesh 안에서 마무리), 다르
면 `_adj_all` (UCIe 경유). `_MCPU_DMA_EXCLUDE` 가 PE pipeline 노드를 자동
배제하므로, M_CPU 가 PE 의 내부 stage 를 거쳐 routing 되는 잘못된 경로가
나오지 않는다.
#### D3.4. `find_memory_path(src: str, dst: str) -> list[str]`
`pcie_ep → io_noc → cube → router mesh → hbm_ctrl` 같은 직접 메모리
경로. `_adj_mcpu_dma` 를 사용하여 `pe_internal` 및 `pe_to_router` edge
를 제외 — host-issued read/write 가 PE pipeline 으로 새지 않게 보장.
probe (ADR-0049 D1 의 H2D/D2H case) 에서 직접 호출.
#### D3.5. `find_node_path(src: str, dst: str) -> list[str]`
임의의 두 node 사이의 path. **command edge 포함** (`_adj_all` 사용). M_CPU
↔ NOC 같은 command-kind link 를 거쳐야 하는 IoCpuComponent /
MCpuComponent 등이 호출.
#### D3.6. backward-compat shims
- `_dijkstra(start, goal) -> list[str]` — `_run_dijkstra(self._adj, …)`
의 thin wrapper.
- `_dijkstra_with_dist(start, goal) -> tuple[list[str], float]` — distance
포함 버전.
언더스코어 prefix 에서 보듯이 내부 API 인 척이지만 기존 테스트가 직접
호출. 새 코드는 D3.1–D3.5 를 사용하고, 이 두 shim 은 deprecation 후보.
### D4. Dijkstra 알고리즘 — single-source shortest path
`_run_dijkstra_with_dist(adj, start, goal)`:
- `heapq` priority queue.
- `best: dict[node, distance]` — 노드별 최단 거리 캐시.
- `prev: dict[node, predecessor]` — path reconstruction.
- weight 는 `routing_weight_mm or distance_mm`. UCIe 처럼 routing_weight 가
명시되어 distance 와 다른 edge 가 있으므로 weight 분리가 의도된 것.
`start == goal` 은 빠른 path `([start], 0.0)` 반환. 도달 불가는
`RoutingError(f"no path from {start} to {goal}")`.
이 알고리즘은 **deterministic** 하다 — 같은 graph + start/goal 이면 같은
경로. 이는 SPEC R1 의 "Routing MUST be deterministic" 요구와 정합. tie-
break 는 `heapq` 의 push 순서를 따른다 (Python list 순서가 deterministic).
### D5. helper API 의 단일 소유자 원칙
다음 정보는 오직 router.py 안에서만 결정된다:
- 명명 규칙: `sip{S}.cube{C}.<comp>`, `sip{S}.{io_id}.<comp>`,
`sip{S}.cube{C}.hbm_ctrl.pe{pe_id}`.
- adjacency 정책: 어떤 edge kind 가 어떤 그래프에 포함되는가.
- HBM slice 크기로부터 PE id 복원 방법.
- Dijkstra의 weight 결정 (`routing_weight_mm or distance_mm`).
이 단일 소유자 원칙이 깨지면 (예: 컴포넌트가 자체적으로 `f"sip{s}..."` 를
구성하기 시작하면) 명명 규칙 변경 시 영향 범위가 폭발한다. ADR-0015 D4 의
정신과 정렬.
### D6. helper API consumer 의 목록
본 helper 가 노출하는 메소드를 호출하는 곳을 명시 (현재 코퍼스 기준):
- `probes/probe.py` (ADR-0049): `find_pcie_ep`, `find_io_cpu`,
`find_m_cpu`, `find_node_path`, `find_mcpu_dma_path`,
`find_memory_path`, `find_path`, `resolve`.
- `runtime_api/distributed.py` (ADR-0047): 간접 (engine 내부 routing).
- `ccl/install.py` (ADR-0023): `find_all_pcie_eps`, `resolve`.
- `sim_engine/event_log.py`: probe 와 유사하게 `find_pcie_ep`,
`find_memory_path`.
- `components/builtin/m_cpu.py`, `components/builtin/io_cpu.py`:
`find_node_path`, `find_mcpu_dma_path`.
- 각종 tests (test_routing.py, test_cross_sip_routing.py 등): D3.1D3.5
대부분.
새 consumer 가 추가될 때 본 ADR 의 D1/D3 가 그 의도에 맞는 메소드가
이미 있는지 / 새 메소드를 추가해야 하는지 1차 판단의 기준이 된다.
## Alternatives Considered
### A1. 단일 adjacency graph + edge-kind filter 동적 적용
기각. 매 `find_*()` 마다 graph filtering 을 다시 하면 Dijkstra 의 cache
locality 와 성능이 떨어진다. 4 개 그래프 동시 구축 (D2) 은 메모리 비용
이 작고 (edge ≤ 수만 건 규모), 호출 시점에 정책 선택이 O(1) 로 결정.
### A2. adjacency 분리를 edge 의 `kind` 가 아닌 별도 metadata 로
기각. edge `kind` 는 이미 topology builder 가 부여하며 (ADR-0015 D4 +
ADR-0017), 별도 metadata 를 도입하면 두 시스템이 동기화되어야 하는
중복이 생긴다.
### A3. Dijkstra 대신 BFS + uniform weight
기각. routing_weight_mm 이 edge 별로 다른 (mesh link / UCIe / IO-internal)
현실에서 BFS 는 hop 수 최소화일 뿐 latency / distance 최단을 보장하지
않는다. SPEC R1 + R2 의 결정적·정확한 routing 요구에 어긋남.
### A4. helper API 를 클래스 메서드가 아닌 모듈 함수로
기각. 두 클래스 (`AddressResolver`, `PathRouter`) 가 각각 cache 상태
(`_node_ids`, `_hbm_slice_bytes`, 4 adjacency graphs) 를 보유해야 하며,
같은 graph 인스턴스에 여러 routing 질의가 발생한다. 모듈 함수는 매 호출
시 state 를 다시 만들거나 global 로 두어야 해서 안전성/성능 저하.
## Consequences
- 컴포넌트 / probe / IPCQ install / runtime API 가 모두 router.py 의
helper 만 호출하면 명명 규칙 변경 (예: `.io0.` → `.iochiplet0.`) 이
단 한 파일 수정으로 끝남 (D5).
- D2 의 4 그래프 분리가 ADR 에 굳어져, 새 edge kind 가 추가될 때 (예:
Inter-die UCIe link 의 새 kind) 어느 그래프에 포함시킬지 결정의 명확
한 기준 제공.
- D3.1 의 cube-local vs cross-cube 분기 (ADR-0017 D7) 가 명시되어, 향후
routing 동작을 변경하려는 사람이 어느 adjacency 를 건드려야 할지 안다.
- D6 의 consumer 목록이 명시되어, helper API 변경 시 PR review 범위가
분명. backward-compat shim (D3.6) 의 deprecation 후보가 식별됨.
@@ -0,0 +1,352 @@
# ADR-0052: OpLog + MemoryStore Schemas — sim_engine internals
## Status
Accepted (2026-05-22).
`sim_engine/op_log.py``OpRecord` 스키마와 `OpLogger` 의 record_start /
record_end / record_copy 동작, 그리고 `sim_engine/memory_store.py`
`MemoryStore` 가 사용하는 (space, addr) 주소공간 namespace 와 read/write
의미를 명시한다. ADR-0020 (2-pass data execution) 가 두 인프라의 존재를
선언하나, **레코드의 정확한 필드와 의미** 는 ADR-level 에서 정리되지
않았고 ADR-0046 D3.2 (`tl.store` visibility), ADR-0023 D9 (IPCQ copy
record) 등 여러 ADR 이 이들의 동작에 의존하고 있다.
## First action (제일 처음에 하는 일)
### `OpLogger(memory_store=None)`
생성 즉시 다음 3 가지 필드 초기화:
1. `self._records: list[OpRecord] = []` — 누적된 op record.
2. `self._pending: dict[int, dict] = {}``id(msg)` 키로 partial record
(record_start 시점에 만들어졌고 record_end 가 아직 안 온 것).
3. `self._memory_store = memory_store` — 옵션 MemoryStore reference.
math op 의 input 스냅샷 + dma_write 의 HBM source 스냅샷 캡처에 사용.
생성 시점에는 records / pending 모두 비어 있으며, `record_*` 호출이
순차적으로 데이터를 누적한다.
### `MemoryStore()`
생성 즉시 `self._storage: dict[str, dict[int, np.ndarray]] = {}` 단 하나
의 필드 초기화. 두 단계 dict (`space → addr → ndarray`) 이며 lazy 하게
필요한 space 가 생길 때마다 inner dict 가 채워진다.
즉, **두 인프라의 첫 일은 "비어 있는 누적 buffer + space-별 sparse dict
를 만들어 두는 것"** 이다. 첫 record / write 가 실제로 도착하면 그때
필드가 채워지기 시작한다.
## Context
ADR-0020 (2-pass data execution) 의 D2/D5/D7 가 다음을 선언:
- Phase 1 (timing) 동안 `ComponentBase._on_process_start/end` hook 이
`OpLogger.record_start/end` 를 호출하여 모든 data op 의 시간 + 메타
데이터를 기록.
- Phase 2 (data) 가 op_log 를 t_start 순으로 재생하여 실 데이터 결과를
계산.
- 데이터 페이로드 자체는 `MemoryStore` 에 (space, addr) 키로 보관.
ADR-0023 D9 (IPCQ atomic write), ADR-0027 (Megatron TP scratch
overwrite 회피), ADR-0046 D3.2 (`tl.store` visibility) 등 후속 ADR 들이
op_log 와 MemoryStore 의 동작에 의존하지만, **정확한 record 필드 / space
이름 / 스냅샷 시점** 은 코드 grep 으로만 확인 가능하다. 본 ADR 이 이를
정리한다.
## Decision
### D1. `OpRecord` 스키마 — 7 개 필드
```python
@dataclass
class OpRecord:
t_start: float
t_end: float
component_id: str
op_kind: str # "memory" | "gemm" | "math" | "unknown"
op_name: str # e.g. "dma_read", "gemm_f16", "exp",
# "TileToken/DMA_READ", "composite_gemm",
# "ipcq_copy"
params: dict[str, Any]
dependency_ids: list[int] = field(default_factory=list)
```
- **`t_start` / `t_end`**: SimPy 시간 (float ns). `t_start` 는 component
가 op 를 시작한 시점, `t_end` 는 완료 시점. duration = `t_end - t_start`.
- **`component_id`**: op 가 발생한 node id (예:
`"sip0.cube0.pe0.pe_dma"`).
- **`op_kind`**: 4 가지 중 하나. Phase 2 DataExecutor 가 이 값으로 분기.
- **`op_name`**: 디버깅 / 분석용 사람-친화 이름. TileToken 일 경우
`"TileToken/{stage_type}"` (예: `"TileToken/DMA_READ"`) 로 stage 를
구분.
- **`params`**: op-종속 메타데이터 dict (D3 참고).
- **`dependency_ids`**: 현재 사용되지 않음 (default `[]`). 향후 cross-op
dependency 추적이 필요해질 때를 위한 자리.
### D2. `OpLogger.records` — t_start 정렬 보장
```python
@property
def records(self) -> list[OpRecord]:
self._records.sort(key=lambda r: r.t_start)
return self._records
```
매 접근 시 `t_start` 로 stable sort. 즉 같은 t_start 인 record 들은 insertion
순서를 유지. ADR-0020 D5 의 "t_start stable ordering" 요구와 정합.
Phase 2 DataExecutor 는 항상 `records` property 를 통해 접근하므로,
record_end 호출이 t_start 와 다른 순서로 도착해도 (예: 짧은 op 가 긴
op 보다 늦게 시작했으나 먼저 끝남) 재정렬되어 일관된 시퀀스를 받는다.
### D3. op_name 별 `params` 스키마 (`_extract_op_info` 매핑)
#### D3.1. `op_kind="memory", op_name="dma_read"` (DmaReadCmd)
```python
{"src_addr": int, "nbytes": int, "handle_id": str}
```
#### D3.2. `op_kind="memory", op_name="dma_write"` (DmaWriteCmd)
```python
{
"src_space": str, # handle.space ("tcm"|"hbm"|"sram"), default "tcm"
"src_addr": int, # handle.addr
"shape": tuple, "dtype": str,
"dst_space": "hbm", # DmaWrite 는 항상 HBM 으로
"dst_addr": int,
"nbytes": int,
"handle_id": str,
# record_end 시점에 src_space == "hbm" 이면 snapshot 추가 (D4)
"snapshot": np.ndarray | None,
}
```
#### D3.3. `op_kind="gemm", op_name=f"gemm_{dtype_a}"` (GemmCmd)
```python
{
"src_a_addr": int, "src_b_addr": int, "dst_addr": int,
"shape_a": tuple, "shape_b": tuple, "shape_out": tuple,
"dtype_in": str, "dtype_out": str,
"m": int, "k": int, "n": int,
# ADR-0027: per-operand + output spaces 보존
"src_a_space": str, "src_b_space": str, "dst_space": str,
}
```
#### D3.4. `op_kind="math", op_name=msg.op` (MathCmd; op = "exp", "sum", "add", "where" 등)
```python
{
"input_addrs": list[int], # 입력 핸들들의 addr
"input_shapes": list[tuple],
"input_spaces": list[str],
"input_dtypes": list[str],
"dst_addr": int, "dst_space": str,
"shape_out": tuple, "dtype": str,
"axis": int | None, # reduction 인 경우만 의미 있음
# record_end 시점에 모든 input 의 스냅샷이 채워짐 (D4)
"input_snapshots": list[np.ndarray | None],
}
```
#### D3.5. `op_kind="gemm" or "math", op_name=f"composite_{op}"` (CompositeCmd)
```python
{
"op": str, # "gemm" | "math"
"out_addr": int, "out_nbytes": int,
# op == "gemm" 인 경우 GemmCmd 와 같은 필드 추가:
"src_a_addr": int, "src_b_addr": int,
"shape_a": tuple, "shape_b": tuple,
"dtype_in": str, "dtype_out": str,
"src_a_space": str, "src_b_space": str,
"dst_space": "hbm", "dst_addr": int, # = out_addr
}
```
`op == "gemm"` 이면 `op_kind = "gemm"`, 아니면 `"math"`. Phase 2 측에서
GemmCmd 와 동일 path 로 재생되도록 alias.
#### D3.6. `op_kind="memory", op_name="ipcq_copy"` (record_copy 전용 경로)
```python
{
"src_space": str, "src_addr": int,
"dst_space": str, "dst_addr": int,
"shape": tuple, "dtype": str, "nbytes": int,
"snapshot": np.ndarray | None, # 호출자가 전달, 없으면 record_copy 가 fresh read
}
```
`PE_DMA._handle_ipcq_inbound` (ADR-0023 D9) 가 이 record 를 발사하여 IPCQ
slot 의 inbound copy 를 Phase 2 가 재생 가능하게 한다. 이 record 는
`record_start` / `record_end` 를 거치지 않고 직접 `record_copy()` 로 push.
#### D3.7. `op_kind="unknown", op_name=type(msg).__name__`
`_extract_op_info` 가 인식 못 한 message 의 fallback. params = `{}`.
DataExecutor 가 이 op_kind 를 만나면 skip — Phase 2 replay 에 영향 없음.
### D4. snapshot 캡처 시점
`OpLogger._memory_store` 가 set 되어 있을 때 record_end 가 다음을 수행:
- **math op**: 모든 input addr/shape/space/dtype 으로
`self._memory_store.read(...)` 를 호출하여 `params["input_snapshots"]`
ndarray copy 첨부. read 실패 시 None.
- **dma_write op**: `src_space == "hbm"` 인 경우에만 source HBM 의
스냅샷을 `params["snapshot"]` 에 첨부. TCM source 는 **명시적으로
스킵** — TCM (PE scratch) 은 Phase 2 math/gemm 재생이 다시 채우므로,
Phase-1-time snapshot 을 잡으면 이전 kernel 의 stale 데이터를 잡을 위험
(ADR-0027 postmortem: TP gemm → all_reduce race).
- **ipcq_copy**: `record_copy` 호출자가 `snapshot=token.data` 같이 in-flight
스냅샷을 전달. 없으면 record_copy 가 fresh read 로 대체 시도.
스냅샷은 `.copy()` 가 호출되어 (`ndarray.copy()` 가 fresh allocation) 이후
storage mutation 으로부터 안전. ADR-0027 의 "cross-PE Phase 2 ordering"
race 회피의 근간.
`memory_store` 가 None 인 경우 (Phase 1 timing-only 모드) 스냅샷 단계는
전부 skip. record 의 timing 정보만 보존되며 데이터 replay 는 불가능.
### D5. TileToken 처리 — record_start 가 stage 정보를 캡처
ADR-0014 D6 의 self-routing tile token (pipeline 모드) 은 stage_idx 가
record_end 시점에 이미 advance 되어 있을 수 있다 (TileToken 이 다음
component 로 이동하면서 next stage 의 params 를 캐시). 따라서:
`record_start` 가 다음을 `pending[id(msg)]["snap"]` 에 미리 저장:
```python
snap["stage_type"] = stage.stage_type.name # "DMA_READ", "GEMM", 등
snap["stage_params"] = dict(stage.params) # 시점의 params 복사본
```
`record_end` 에서 이 snap 을 꺼내 params 에 merge:
- `params["stage_type"]` 가 final params 에 추가.
- `stage_params` 의 key 들이 (이미 있으면 보존) merge.
- `op_name == "TileToken"` 이면 `op_name = f"TileToken/{stage_type}"`
rewrite (예: `"TileToken/DMA_READ"`) — 같은 component 에서 발생한 서로
다른 stage 의 record 를 disambiguate.
이 메커니즘 덕분에 DMA_READ vs DMA_WRITE, FETCH vs STORE 가 같은 component
(예: pe_dma) 에서 발생하더라도 reporting 측에서 구분 가능.
### D6. `MemoryStore` — (space, addr) 두 단계 dict
```python
class MemoryStore:
def __init__(self) -> None:
self._storage: dict[str, dict[int, np.ndarray]] = {}
def write(self, space, addr, data): self._storage[space][addr] = data
def read(self, space, addr, shape=None, dtype=None) -> np.ndarray: ...
def has(self, space, addr) -> bool: ...
def snapshot(self) -> MemoryStore: ...
```
#### D6.1. space namespace
문자열 키. 표준 값:
- `"hbm"`: HBM 데이터 (deploy_tensor + Phase 2 dma_write 결과).
- `"tcm"`: PE-로컬 TCM (Phase 2 math/gemm 결과).
- `"sram"`: cube-level SRAM (ADR-0023 D9.7 IPCQ slot tier).
다른 space (예: `"reg"`) 도 자유롭게 허용 — `_storage` 가 lazy dict 라
새 space 가 write 호출과 함께 자동 생성.
#### D6.2. address keying
`addr` 는 정수. **physical address (PA) 또는 virtual address (VA)** 일 수
있다 — MemoryStore 자체는 address space 의 의미를 모르고 그저 키로 쓴다.
Phase 1 의 `MemoryWriteMsg` 는 PA + VA 둘 다 write (`_create_tensor` 에서
PA 로 zero-init, VA base 로도 zero-init), Phase 2 는 op_log 가 captured
한 address 로 read/write.
`addr` 의 의미는 호출자가 결정한다 — `MemoryStore` 는 lookup 만 제공.
#### D6.3. read/write 의미 — reference store (no copy)
`write(space, addr, data)`: `data` ndarray 의 reference 를 저장. **copy
하지 않음**. 호출자가 같은 ndarray 를 이후 mutate 하면 stored value 도
변경된다.
`read(space, addr, shape=None, dtype=None)`: 저장된 ndarray 의 reference
반환. `shape` 또는 `dtype` 이 제공되면:
- `dtype != stored.dtype`: `arr.view(np_dtype)` 로 reinterpret cast (no
copy).
- `shape != stored.shape`: `nbytes` 가 일치하면 `arr.reshape(shape)` (view).
- `nbytes` 불일치: `ValueError`.
데이터를 안전하게 분리하려면 호출자가 `arr.copy()` 호출. ADR-0027 의
race 회피가 op_log snapshot 단계에서 명시적 copy 를 강제하는 이유.
#### D6.4. `has(space, addr) -> bool`
해당 키의 존재 여부만 확인. 데이터 인스턴스화는 안 함.
#### D6.5. `snapshot() -> MemoryStore`
shallow copy. inner dict 의 새 인스턴스를 만들되 ndarray reference 는
공유. Phase 2 초기화 시점에 Phase 1 의 store 를 fork 하여 Phase 2 의
mutation 이 Phase 1 의 다른 사용처에 영향을 주지 않게 분리하는 데 사용.
### D7. op_log 가 SimPy 단일-스레드를 가정한다
`OpLogger``_records`, `_pending` 은 lock 없이 사용. SimPy 가 single-
threaded 라 `record_start``record_end` 사이에 다른 thread 가 끼어들
수 없다는 가정.
향후 multi-process kernbench (ADR-0047 D6) 가 도입되면 OpLogger 도 process
별로 분리되어야 함이 명시. 단일 OpLogger 인스턴스가 multiple process 의
record 를 받지 못한다.
## Alternatives Considered
### A1. op_log 를 SQLite / parquet 같은 외부 store 로
기각 (현재). in-memory list 가 Phase 1 → Phase 2 의 핸드오프 latency 를
최소화한다. 외부화는 long-running batch run 에서 의미가 있겠으나, 현재
single-run 워크로드 에서는 overhead 만 추가.
### A2. snapshot 을 record_start 시점에 캡처
기각. record_start 시점은 input 이 아직 채워지지 않은 상황 (예: math
op 의 input 이 직전 op 의 output 일 때) 이 흔하다. record_end 가 정확한
시점.
### A3. MemoryStore 를 component-별 store 로 분리
기각. (space, addr) 키가 이미 충분히 disambiguation 을 제공하며, component
별 분리는 cross-PE IPCQ copy (ADR-0023 D9) 가 source/destination 양쪽
store 를 접근해야 하는 케이스를 복잡하게 만든다.
### A4. op_log 에 cross-op dependency edge 명시
부분 채택. `dependency_ids` 필드가 OpRecord 에 자리 잡고 있지만 현재
사용되지 않음 (D1). Phase 2 DataExecutor 가 t_start 정렬 + secondary sort
(memory ops before math at same t_start) 로 ordering 을 결정하며, 명시적
dependency graph 가 필요해지면 이 필드가 채워질 자리. 현재는 ordering rule
이 충분하므로 미사용.
## Consequences
- ADR-0020 의 op_log / MemoryStore 선언이 D1D6 의 구체 schema 로 확장
되어, Phase 2 DataExecutor 작성/수정 시 정확한 필드 의미를 grep 없이
ADR 에서 확인 가능.
- D3 의 op_name 별 params 스키마가 명시되어, 새 op (예: 새 reduction
type) 추가 시 `_extract_op_info` 분기 어디에 끼울지 명확.
- D4 의 snapshot 시점 차이 (math = input snapshot, dma_write = HBM-only
snapshot) 가 ADR 에 굳어져, ADR-0027 의 cross-PE race 회피 결정이 향후
refactor 에서 silently 깨지지 않음.
- D6.3 의 reference-store 의미가 명시되어, 호출자가 mutation safety 책임
을 인지. ADR-0027 의 explicit `.copy()` 패턴이 정당화됨.
- D7 의 single-thread 가정이 명시되어, multi-process kernbench (ADR-0047
D6 supersession 후보) 도입 시 OpLogger 분리가 필요함이 분명.
@@ -0,0 +1,307 @@
# ADR-0053: Topology Builder + Visualizer Algorithms
## Status
Accepted (2026-05-22).
`topology/builder.py`, `topology/mesh_gen.py`, `topology/visualizer.py`
함께 수행하는 토폴로지 컴파일·시각화 파이프라인의 핵심 알고리즘 선택
(placement-driven router attachment, mesh auto-layout, source_hash 캐시,
view projection, SVG rendering) 을 명시한다. ADR-0006 가 topology
compilation 의 high-level intent (compiled topology, distance extraction,
automatic diagram generation) 를 정의하나, **builder 가 실제로 어떤
알고리즘을 사용하는지** 는 코드 grep 으로만 확인 가능했다.
## First action (제일 처음에 하는 일)
`resolve_topology(path_str)` 가 호출되면 다음 4 단계가 순서대로 일어난다:
1. **경로 검증** (`builder.py::resolve_topology`):
`Path(path_str).expanduser().resolve()`, 존재 확인, file 여부 확인.
실패 시 `FileNotFoundError` 또는 `ValueError`.
2. **YAML 파싱** (`_read_spec`): `yaml.safe_load`. parse error 면 line/
column 정보 포함한 `ValueError`. dict 가 아니면 reject.
3. **mesh 자동 생성** (`mesh_gen.ensure_mesh_file`): topology yaml 과
같은 디렉터리에 `cube_mesh.yaml` 을 만들거나 (캐시 invalid 시) 재사용
(캐시 hit 시). 이 단계가 cube NoC 의 라우터 grid 와 부착 정보를 결정.
4. **graph 컴파일** (`_compile_graph`): system → IO chiplets → cubes →
inter-cube edges → IO↔cube edges → system↔IO edges 순으로 nodes/edges
를 누적, 그 다음 4 개의 view projection (system, sip, cube, pe) 을
생성하여 `TopologyGraph` 로 묶음.
즉, **topology compile 의 첫 일은 "topology.yaml 을 dict 로 읽고, 동일
디렉터리에 cube_mesh.yaml 을 생성/검증한 뒤, system→sip→cube→pe 순으로
flat graph + 4-view projection 을 만드는 것"** 이다.
## Context
`topology/` 패키지의 책임:
- **builder.py** (1207 줄): topology.yaml 을 받아 `TopologyGraph` (nodes
+ edges + 4 view projections) 를 컴파일.
- **mesh_gen.py** (305 줄): cube NoC 의 라우터 grid 와 PE/UCIe/M_CPU/SRAM
부착 위치를 자동 결정하여 `cube_mesh.yaml` 로 캐시.
- **visualizer.py** (887 줄): `TopologyGraph` 로부터 SVG 다이어그램 4종
(system / sip / cube / pe) 을 생성.
ADR-0006 가 "topology compilation 의 결과는 distance metadata 와 diagram
generation 의 single source" 라는 high-level 결정을 정의하나, 구체 알고리즘
(예: placement-driven nearest-router attachment, HBM 제외 zone 산출,
source_hash 의 어떤 필드가 invalidation 을 트리거하는가) 은 ADR 에 없다.
특히 다음 결정들이 ADR-level 에 부재:
- 왜 mesh_gen 이 별도 파일 (`cube_mesh.yaml`) 로 캐시되는가?
- source_hash 가 어떤 필드를 포함하며, 어떤 변경이 재생성을 강제하는가?
- placement coordinate 가 cube 좌표가 아닌 mm 단위인 이유?
- HBM zone 제외와 UCIe N/S/E/W 분배가 mesh 안에서 어떻게 결정되는가?
- view projection 4 개 (system/sip/cube/pe) 의 추상화 레벨 차이?
이 ADR 이 이 결정들을 한 곳에 정리한다.
## Decision
### D1. compile 파이프라인 — 6 단계
`_compile_graph(spec)`:
1. **시스템 노드 생성** (`_instantiate_system`): `fabric.switch0`, host CPU
등 system-level 노드 추가.
2. **per-SIP loop** (`for sip_id in range(system.sips.count)`):
- **IO chiplets** (`_instantiate_io_chiplets`): pcie_ep / io_cpu /
io_noc / io_ucie PHY / conn 노드 + 내부 양방향 edge 생성.
- **cube instantiation** (`_instantiate_cube`): cube_mesh.yaml 의 router
grid 를 토대로 cube-별 라우터, PE sub-components (pe_cpu, pe_dma,
pe_fetch_store, pe_gemm, pe_math, pe_mmu, pe_tcm, pe_scheduler,
pe_ipcq), m_cpu, sram, hbm_ctrl 인스턴스화 + 내부 edge 깔기.
- **inter-cube edges** (`_add_inter_cube_edges`): UCIe N/S/E/W mesh
edge.
- **IO ↔ cube edges** (`_add_io_to_cube_edges`): io_noc 와 cube 의
edge UCIe phy 사이 연결.
3. **switch ↔ IO edges** (`_add_system_to_io_edges`): `fabric.switch0`
와 각 SIP 의 `pcie_ep` 사이 양방향 edge (ADR-0038 D3 + ADR-0010 의
cross-SIP IPCQ 경로).
4. **view projections** 4 종 build:
- `_build_system_view(spec)` — Tray 레벨, SIP 들과 system switch.
- `_build_sip_view(spec)` — SIP 안의 cube mesh + IO chiplet.
- `_build_cube_view(spec)` — 단일 cube 안의 router grid + PE/M_CPU/SRAM/
HBM_CTRL 부착.
- `_build_pe_view(spec)` — 단일 PE 안의 9 sub-components + 내부 edge.
5. **TopologyGraph 리턴**: `TopologyGraph(spec, nodes, edges, system_view,
sip_view, cube_view, pe_view)`.
이 6 단계는 **순서가 의미를 가진다**: cubes 가 만들어진 후에야 inter-cube
edges 가 valid 한 src/dst 를 갖고, IO chiplet 이 먼저 만들어져야 IO ↔ cube
edge 가 그를 참조할 수 있다. 새 노드 종류를 끼울 때는 의존 관계를 보고
적절한 위치에 삽입해야 한다.
### D2. `cube_mesh.yaml` — 별도 파일 + source_hash 캐시
`mesh_gen.ensure_mesh_file(cube_spec, mesh_path)`:
1. `source_hash = _compute_source_hash(cube_spec)` 산출. 입력 필드:
- `geometry` (cube_mm.w/h 등).
- `pe_layout` (corners, pe_per_corner).
- `ucie.n_connections`.
- `memory_map.hbm_mapping_mode`.
- `placement` (m_cpu/sram pos_mm).
2. `mesh_path` (= `topology.yaml` 와 같은 디렉터리의 `cube_mesh.yaml`) 이
존재하고 `existing.source_hash == source_hash` 면 재사용 (캐시 hit).
3. 아니면 `_generate_mesh(cube_spec, source_hash)` 로 새 mesh 생성 후
yaml 로 저장.
별도 파일로 캐시하는 이유:
- mesh 생성은 PE/UCIe/router 부착 계산이 들어가 매번 다시 하기 무거움.
- 같은 cube spec 으로 여러 번 실행 시 동일 mesh 가 보장되어야 함.
- 사람이 직접 mesh 를 inspect / debug 할 수 있는 artifact 가 됨.
`source_hash` 가 list 한 5 개 필드가 mesh 형상을 결정하는 핵심이며, 그
외 (예: bandwidth, overhead_ns) 변경은 mesh 재생성을 트리거하지 않는다.
### D3. cube NoC mesh auto-layout 알고리즘
`_generate_mesh(cube_spec)`:
#### D3.1. 행/열 결정
- `pe_positions = _corner_pe_positions(cube_w, cube_h)`: 4 corner (NW/NE/
SW/SE) 마다 PE center 좌표 (mm). hardcoded `(1.5, 1.5)` / `(cube_w-1.5,
cube_h-1.5)` 패턴 + `pe_per_corner=2` 면 각 corner 에 2 PE 위치.
- `col_xs = _compute_col_positions(...)`: PE 들의 x 좌표 union + `max_spacing
= 3.0 mm` 보다 큰 gap 에 relay 컬럼 삽입.
- `row_ys, rows_per_half = _compute_row_positions(cube_h, n_connections,
pe_positions)`:
- `n_conn = max(n_connections, 2)` (hot path minimum).
- `rows_per_half = ceil(n_conn / 2)`.
- top 절반 + HBM 두 row + bottom 절반. HBM 은 `(cube_h/2 - 1.5, cube_h/2
+ 1.5)` 에 위치. PE rows 와 HBM rows 사이 `hbm_gap = 1.5 mm`.
#### D3.2. HBM 제외 zone
`hbm_row_start = rows_per_half`, `hbm_row_end = rows_per_half + 1`.
`hbm_col_start = n_cols // 2 - 1`, `hbm_col_end = n_cols // 2`.
이 (row, col) 사각형 안의 router 슬롯은 `None` 으로 마킹 (라우터 없음).
실제 HBM 컨트롤러는 별도 `hbm_ctrl.pe{X}` 노드로 ADR-0017 D9 의 per-PE
파티션 패턴을 따라 부착.
#### D3.3. PE 부착
각 corner 의 PE 들은 다음 row 에 매핑:
- Top half: NW → row 0, NE → row 1 (top_corners 안의 index).
- Bottom half: SW → row `hbm_row_end + 1`, SE → row `hbm_row_end + 2`.
각 PE 의 x 좌표가 가장 가까운 col 의 router 에 부착 (`min(range(n_cols),
key=lambda c: abs(col_xs[c] - pe_x))`). 부착 항목은 `pe{pe_idx}.dma`,
`pe{pe_idx}.cpu`, `pe{pe_idx}.hbm` 세 가지 (router 별 attach list 에 push).
#### D3.4. M_CPU / SRAM 부착 — nearest router by Euclidean distance
`placement.m_cpu.pos_mm` (default `[1.5, 5.5]`) 와 `placement.sram.pos_mm`
(default `[1.5, 8.5]`) 의 좌표에서 가장 가까운 router 를 Euclidean
distance 로 찾아 attach list 에 `"m_cpu"` / `"sram"` 추가.
#### D3.5. UCIe N/S/E/W 분배
`ucie_pe_rows = top_pe_rows + bot_pe_rows` (총 `2 * rows_per_half` 개).
- UCIe-E: 매 PE row 마다 rightmost col 의 router 에 `ucie_e.c{i}`.
- UCIe-W: leftmost col 의 router 에 `ucie_w.c{i}` (E 의 mirror).
- UCIe-N/S: PE column 들 중 절반을 좌측, 절반을 우측으로 나눠 top row /
bottom row 의 해당 col 에 부착.
각 UCIe connection 은 `c{i}` index 가 붙어 ucie_n_connections 만큼의 PHY
가 분산된다 (ADR-0017 D5+).
### D4. node 명명 규칙 — 단일 소유자
builder.py 는 다음 명명 규칙으로 노드를 만든다 (ADR-0051 D5 의 단일
소유자 원칙):
- `fabric.switch0` — system-level switch.
- `sip{S}.{io_id}.{pcie_ep|io_cpu|io_noc|io_ucie.{dir}|conn.{id}}` — IO
chiplet.
- `sip{S}.cube{C}.{m_cpu|sram|hbm_ctrl.pe{X}|noc.r{R}c{C}|...}` — cube 내부.
- `sip{S}.cube{C}.pe{P}.{pe_cpu|pe_dma|pe_fetch_store|pe_gemm|pe_math|pe_mmu|pe_tcm|pe_scheduler|pe_ipcq}` — PE sub-components.
이 명명 규칙을 변경하려면 builder.py 와 router.py (ADR-0051) 의 helper
양쪽이 함께 갱신되어야 한다. 컴포넌트는 명명 규칙을 직접 알지 못하고
helper 만 호출한다.
### D5. edge `kind` 분류
각 edge 가 부여받는 `kind` 가 라우팅 정책 (ADR-0051 D2) 의 입력. 주요
kind 값:
- `"pe_internal"` — PE 내부 sub-component 간.
- `"pe_to_router"` — PE_DMA ↔ cube NoC router.
- `"router_mesh"` — cube NoC router 간.
- `"router_to_hbm"`, `"router_to_mcpu"`, `"router_to_sram"`,
`"sram_to_router"` 등 — cube-attached component 간.
- `"ucie_internal"`, `"ucie_conn_to_router"`, `"router_to_ucie_conn"`,
`"ucie_conn_to_noc"`, `"noc_to_ucie_conn"`, `"ucie_mesh"` — UCIe 관련.
- `"io_internal"` — IO chiplet 내부.
- `"io_to_cube"`, `"cube_to_io"` — IO ↔ cube 경계.
- `"pcie"` — switch ↔ pcie_ep.
- `"command"` — control-plane only edges (M_CPU ↔ NOC 등; PE DMA path 에서
제외).
새 edge kind 를 추가하면 router.py 의 4 adjacency graph (ADR-0051 D2) 의
어느 카테고리에 속할지 결정해야 한다 — 그렇지 않으면 default 로 `_adj_all`
에만 포함되어 의도와 다른 routing 발생 가능.
### D6. view projection — 4 추상화 레벨
`TopologyGraph` 는 flat (nodes + edges) 외에 4 개의 view projection 을
보유:
- **system_view** (`_build_system_view`): Tray 레벨. SIP 박스들 + `fabric.
switch0`. PCIE 링크 표시. 외부 발표용 high-level overview.
- **sip_view** (`_build_sip_view`): 한 SIP 안. cube mesh + IO chiplet
(pcie_ep + io_cpu + io_noc). UCIe N/S/E/W 가 cube 간 연결로 보임.
- **cube_view** (`_build_cube_view`): 한 cube 안. router grid + PE/M_CPU/
SRAM/HBM_CTRL 부착 + UCIe PHY edge 부분. cube 내부 라우팅 / placement
진단용.
- **pe_view** (`_build_pe_view`): 한 PE 안. 9 sub-components + 내부 edge
(pe_internal kind). 자세한 PE 내부 dataflow 검토용.
view 는 spec 에서 `visualization.emit_views: [system, sip, cube]` 같이
선택적으로 출력 (ADR-0006). pe view 는 기본 출력에서 빠져 있으나 코드는
유지 (자세한 디버그용).
### D7. visualizer.py — SVG 다이어그램 출력
`emit_diagrams(graph, out_dir)` 가 모든 view 를 SVG 로 렌더. 핵심 함수:
- `_render_view_svg(view)` — 일반적인 view 렌더 (router grid 가 없는
경우).
- `_render_cube_view_svg(view, spec)` — cube view 전용 (HBM block 그리기,
router grid layout, PE/M_CPU/SRAM/HBM positioning).
- `_draw_node`, `_draw_edge` — 노드 / edge 의 시각적 표현.
- `_pick_scale`, `_compute_node_sizes` — 자동 스케일링.
visualizer 는 **derived artifact** (ADR-0006) 로 분류되며, 코드 변경 시
production check 대상이 아니다. CLAUDE.md 의 "Derived Artifacts" 항목과
정합.
### D8. spec 변경의 영향 범위
| spec 필드 | 영향 | mesh 재생성 |
|---------------------------------------|-------------------|-------------|
| `system.sips.count` | SIP 갯수, node 수 | No |
| `sip.cube_mesh.w/h` | cube mesh 형상 | No |
| `cube.geometry.cube_mm.w/h` | cube 크기 (mm) | **Yes** |
| `cube.pe_layout.corners/pe_per_corner`| PE 부착 위치 | **Yes** |
| `cube.ucie.n_connections` | UCIe PHY 분배 | **Yes** |
| `cube.memory_map.hbm_mapping_mode` | HBM 분배 모드 | **Yes** |
| `cube.placement` | M_CPU/SRAM 위치 | **Yes** |
| `cube.memory_map.*` (위 제외) | HBM 용량 / BW | No |
| `*.links.*.bw_gbs` | edge bandwidth | No |
| `*.attrs.overhead_ns` | 컴포넌트 latency | No |
위 표가 D2 의 `_compute_source_hash` 입력과 일치. mesh 재생성이 필요한
변경은 `cube_mesh.yaml` 의 source_hash 가 자동 invalidate.
## Alternatives Considered
### A1. mesh 를 별도 캐시 파일 없이 매 compile 시 재생성
기각. 같은 spec 으로 여러 번 호출되는 케이스 (CLI run, probe, test) 마다
mesh 생성 비용을 다시 지불. 또한 사람이 mesh 를 inspect 할 수 있는 artifact
가 사라짐.
### A2. mesh 생성을 builder.py 에 합치기
기각 (현재). 305 줄 짜리 자체 알고리즘이며, mesh layout 의 결정 (placement-
driven router attachment, HBM exclusion zone) 이 builder 의 일반적인
node/edge 생성 책임과 다르다. 분리 유지가 단일 책임 원칙에 더 부합.
### A3. placement coordinate 를 cube 좌표 (col/row) 로 표현
기각. mm 단위 좌표가 시각화 측 (visualizer) 과 mesh layout 측 (nearest-
router 산출) 양쪽에서 일관되게 쓰인다. cube 좌표는 router grid 가 결정
되기 전까지는 정의되지 않으므로 placement 입력에 부적절.
### A4. view projection 을 lazy 하게 생성
기각 (현재). 4 개 view 의 생성 비용이 작고 (보통 < 100 ms), eager 생성이
`TopologyGraph` 를 통한 single source of truth 를 보장.
### A5. visualizer 출력 형식을 SVG 외 (PNG/PDF) 도
기각. SVG 가 vector + 텍스트 검색 가능 + 브라우저 직접 렌더가 가능한 가장
유연한 형식. PNG 변환이 필요하면 별도 도구 (rsvg-convert 등) 로 후처리.
## Consequences
- ADR-0006 의 high-level intent 가 D1D7 로 구체화되어, topology 변경
영향을 D8 표로 빠르게 가늠 가능.
- D3 의 mesh auto-layout 알고리즘이 ADR-level 에서 굳어져, 추후 새 PE
부착 패턴 (예: HBM 의 6-zone 분할) 도입 시 어느 단계가 영향받는지 명확.
- D5 의 edge kind 목록과 D7 의 view 구조가 명시되어, 새 component 종류
추가 시 (builder + router + visualizer) 어디까지 손대야 하는지 PR
reviewer 가 한눈에 파악 가능.
- D2 의 source_hash invalidation 규칙이 명시되어, cube_mesh.yaml 이 stale
하게 남는 경우 (예: bw 값만 바꿨을 때) 가 정상 동작임이 분명.
@@ -0,0 +1,138 @@
# ADR-0054: 마일스톤 평가 bench — 자기완결적 sweep + figure bench
## Status
Accepted (2026-05-22).
ADR-0044(D1/D2)와 ADR-0045(D5)를 개정하고, ADR-0043/0044의 "로직이
`scripts/` + `tests/`에 산다" 배치를 대체한다: GEMM/allreduce 평가
하니스가 이제 사용자가 실행하여 모든 결과 + figure를 재생성하는
자기완결적 **bench**가 된다.
## Context
ADR-0043(allreduce 평가)과 ADR-0044(GEMM 평가)는 각 하니스를 **sweep**
(수동 `scripts/` 드라이버, 또는 allreduce의 경우 parametrized 테스트
자체) + committed 데이터를 렌더링하는 **figure 테스트**로 분리했다.
따라서 sweep/render 로직은 `scripts/gemm_sweep.py`,
`tests/gemm/_gemm_plot_helpers.py`, `tests/sccl/_allreduce_helpers.py`
존재했다.
마일스톤 요구사항("사용자가 *하나의 bench*를 실행해 모든 결과와 플롯을
생성하도록 allreduce + GEMM 평가를 리팩터")은 그 배치로는 충족 불가다:
bench는 production 코드이며 **`tests/`를 import할 수 없다**(ADR-0007 레이어
방향). 평가 로직은 bench에서 닿을 수 있도록 production으로 이동해야 했다.
선택한 home은 별도 `kernbench.eval` 패키지가 아니라 bench 모듈 자체다.
bench 파일은 임의의 모듈 레벨 코드를 가질 수 있으며, 하니스를 bench로
합치면 도메인당 파일 하나가 유지되고 패키지 레이어가 하나 줄어든다.
## Decision
### D1. 두 마일스톤 bench가 평가 로직을 보유
- `src/kernbench/benches/milestone_1h_gemm.py` — GEMM shape×variant sweep
+ 세 figure renderer(`scripts/gemm_sweep.py` +
`tests/gemm/_gemm_plot_helpers.py`에서 이동).
- `src/kernbench/benches/milestone_1h_ccl.py` — distributed allreduce
드라이버, latency + buffer-kind sweep, topology diagram, FSIM 비교, 그리고
direct-launch 패리티 레퍼런스(`tests/sccl/_allreduce_helpers.py`에서 이동).
각 파일은 해당 도메인 평가 로직의 **단일 home**이다.
### D2. "평가 bench" 패턴 (ADR-0045 D5 확장)
ADR-0045 D5는 bench를 단일 구성(single-SIP, 또는 ADR-0024 multi-SIP CCL
예외)으로 고정했다. 본 ADR은 세 번째 패턴을 추가한다:
- **평가 bench**는 *여러* 구성을 구동하고 figure를 렌더링할 수 있다. 외부
`run_bench` 엔진 대신 sweep 지점마다 자체 `GraphEngine` /
`RuntimeContext`를 빌드한다.
- 그러면 외부 ctx에 제출된 handle이 없으므로, bench는 마지막에
**sentinel 텐서**(`torch.zeros((1, 1), …)`)를 제출하여 `run_bench`
"최소 한 번 제출" 계약(ADR-0045 D4)을 만족시키고 CLI가 0으로 종료되게
한다.
### D3. 출력 위치
두 bench 모두 `src/kernbench/benches/1H_milestone_output/{gemm,ccl}/`
쓴다(사용자 요청 — bench 옆 아티팩트). 디렉터리는 생성된 PNG/CSV/JSON만
보유하며(`.py`/`__init__.py` 없음), 따라서 eager-import audit(ADR-0045
첫 동작)이 무시한다 — `pkgutil.iter_modules`는 비-패키지 하위 디렉터리를
yield하지 않는다. `docs/diagrams/` 아티팩트처럼 **커밋된다**(원격에서
figure를 볼 수 있도록); bench 재실행 시 제자리에서 재생성된다.
### D4. GEMM 무거운 sweep — 기본은 fresh, `MILESTONE_FAST`로 재사용
`milestone-1h-gemm`은 기본적으로 전체 24-sim sweep을 실행한다(분 단위;
한 shape는 2048 tile). `MILESTONE_FAST=1`은 committed
`docs/diagrams/gemm_sweep.json`을 재사용하고 렌더링만 한다(초 단위). 이는
ADR-0044 D1/D2의 "무거운 sweep은 수동/`slow` 단계로 유지"를 뒤집는다:
bench 실행이 곧 재생성이다. slow 경로는 `@pytest.mark.slow` bench
테스트로 행사되고, fast 경로는 기본 실행된다.
### D5. 테스트 + 스크립트는 thin re-export shim으로 재사용 (단일 home 유지)
기존 figure 테스트와 `scripts/gemm_sweep.py` 진입점은 유지되며 이제 bench
모듈을 재사용한다:
- `tests/gemm/_gemm_plot_helpers.py` → renderer +
`GEMM_SWEEP_JSON`/`GEMM_PLOTS_DIR`/`ROOT`
`kernbench.benches.milestone_1h_gemm`에서 re-export.
- `tests/sccl/_allreduce_helpers.py` → 드라이버 코어, config writer, sweep
상수, renderer, disk aggregator를 `kernbench.benches.milestone_1h_ccl`에서
re-export하고, **pytest 전용** 조각은 로컬 유지: `pytest.param` 행렬
(`CONFIGS` / `_sweep_params` / `_bk_params`)과 fixture 결합
`_run_distributed`(`monkeypatch.chdir` + `_drive_distributed`) wrapper.
- `scripts/gemm_sweep.py` → bench의 `run_sweep` 위 thin wrapper.
테스트가 bench 모듈을 import하는 것은 허용된다(테스트는 production 위에
위치, ADR-0007); 이는 전체 패키지 eager audit을 유발하며, 그것은 이미 매
`kernbench` 실행 시 동작한다. matplotlib는 renderer 내부에서 lazy import로
유지되어 audit의 startup 비용은 불변이다.
### D6. 평면 모듈 네이밍 (`benches/` 하위 폴더 없음)
`1H_milestone…`로 명명된 `benches/` 하위 패키지는 불가능하다 — Python
패키지 이름은 숫자로 시작할 수 없다. 따라서 bench는 평면 모듈
`milestone_1h_gemm.py` / `milestone_1h_ccl.py`이며 bench 이름은
`milestone-1h-gemm` / `milestone-1h-ccl`(kebab-case, ADR-0045 D1에 따라
글자로 시작)이다.
## Consequences
### Positive
- `kernbench run --bench milestone-1h-gemm`(또는 `…-ccl`)이 도메인의 모든
결과 + figure를 한 명령으로 재생성한다 — 마일스톤 요구사항.
- 평가 로직의 단일 소스(bench), shim을 통해 테스트와 스크립트가 재사용;
중복 없음.
- figure 테스트와 `scripts/gemm_sweep.py`는 변경 없이 계속 동작.
### Negative / limitations
- 두 bench 파일이 크다(CCL 쪽은 distributed 드라이버, sweep, matplotlib
드로잉을 섞는다). 대부분 평가 하니스인 "bench"는 이례적이며, 본 ADR이
이를 정당화한다.
- 생성 아티팩트가 명시적 요청에 의해 source tree(`src/kernbench/benches/`)
안에 살며 커밋된다(원격에서 figure를 볼 수 있도록); bench 재실행 시
재생성된다.
- `milestone-1h-ccl`(및 기본 `milestone-1h-gemm`)은 분 단위 소요 —
on-demand 마일스톤 아티팩트에는 수용 가능, 일상 실행에는 아님.
## Dependencies
- **ADR-0007**: 레이어 방향(테스트는 production을 import할 수 있으나 bench는
테스트를 import할 수 없는 이유).
- **ADR-0043 / ADR-0044**: 본 ADR이 bench로 이전하는 allreduce / GEMM 평가
하니스.
- **ADR-0045**: bench 모듈 계약; 여기 D2가 그 D5(single-device 규칙)를
평가-bench 패턴으로 확장하고, sentinel을 위해 D4(NO_REQUESTS)에 의존.
- **ADR-0024**: allreduce sweep이 구동하는 rank = SIP launcher.
## Open questions
- GEMM theoretical 모델 상수(ADR-0044 D5)를 복사 대신 ADR-0033/0014에서
소싱해야 하는가? 본 ADR로는 불변.
- `build_overview_slides.py`가 GEMM 막대를 네이티브로 그리는 대신 마일스톤
출력 PNG를 소비해야 하는가? 여전히 open(ADR-0044 D6 / Negative).
+175
View File
@@ -0,0 +1,175 @@
# ADR Index
Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **47**.
Classification mirrors the `/report` skill's section assignment. When adding a new ADR, also add an entry to the `CLASSIFICATION` table in `tools/generate_adr_index.py`.
## Design Principles
- [ADR-0013](./ADR-0013-ver-verification-strategy.md) — 검증 전략 및 Phase 1 테스트 계획
- [ADR-0033](./ADR-0033-lat-latency-model-assumptions.md) — 레이턴시 모델: 가정 및 알려진 단순화
## High-level Architecture
- [ADR-0003](./ADR-0003-dev-target-system-hierarchy.md) — 타겟 시스템 계층 및 모델링 범위 _(System hierarchy (Tray / SIP / CUBE / PE))_
- [ADR-0007](./ADR-0007-api-runtime-api-boundaries.md) — 런타임 API 및 시뮬레이션 엔진 경계 _(Runtime API ↔ sim_engine boundaries)_
- [ADR-0016](./ADR-0016-dev-iochiplet-noc-and-memory-path.md) — IOChiplet NoC와 메모리 데이터 경로 _(IOChiplet NOC and memory data path)_
- [ADR-0017](./ADR-0017-dev-cube-noc-and-hbm-connectivity.md) — 큐브 NoC와 HBM 연결성 _(Cube NOC and HBM connectivity)_
## Detailed Architecture
One subsection per component file under `src/kernbench/components/builtin/`.
### forwarding
- [ADR-0037](./ADR-0037-dev-forwarding-component.md) — Forwarding 컴포넌트 (forwarding_v1)
### hbm_ctrl
- [ADR-0034](./ADR-0034-dev-hbm-controller-internal-design.md) — HBM 컨트롤러 내부 설계
### io_cpu
- [ADR-0036](./ADR-0036-dev-io-cpu-component-model.md) — IO_CPU 컴포넌트 모델
### m_cpu
- [ADR-0035](./ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md) — M_CPU 및 M_CPU.DMA 컴포넌트 모델
### pcie_ep
- [ADR-0038](./ADR-0038-dev-pcie-ep-component-model.md) — PCIE_EP Component Model
### pe_cpu
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
### pe_dma
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
### pe_fetch_store
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
### pe_gemm
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
### pe_ipcq
- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
### pe_math
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
### pe_mmu
- [ADR-0039](./ADR-0039-dev-pe-mmu-component-model.md) — PE_MMU Component Model — 컴포넌트 + 유틸리티 이중 역할
### pe_scheduler
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
### pe_tcm
- [ADR-0040](./ADR-0040-dev-pe-tcm-component-model.md) — PE_TCM Component Model — 듀얼 채널 BW 직렬화
### sram
- [ADR-0041](./ADR-0041-dev-cube-sram-component-model.md) — Cube SRAM Component Model — terminal scratchpad on cube NoC
### tiling
- [ADR-0042](./ADR-0042-prog-tile-plan-generators.md) — Tile Plan Generators — GEMM/Math 파이프라인 plan 빌더
## Implementation Decisions
### Address Scheme
- [ADR-0001](./ADR-0001-mem-physaddr-layout.md) — 51비트 물리 주소 레이아웃 및 디코딩 계약
- [ADR-0011](./ADR-0011-mem-memory-addressing-simplification.md) — 메모리 주소 지정 — PA / VA / LA 주소 모델
### Routing & Helper API
- [ADR-0002](./ADR-0002-lat-routing-distance.md) — 라우팅 거리, 순서 및 우회 규칙
- [ADR-0051](./ADR-0051-lat-routing-helper-api.md) — Routing Helper API — `AddressResolver` + `PathRouter`
### Memory Semantics & Local-HBM Bandwidth
- [ADR-0004](./ADR-0004-mem-memory-semantics-local-hbm.md) — 메모리 시맨틱 및 로컬 HBM 대역폭 보장
### Topology Compilation, Diagrams & Builder Algorithms
- [ADR-0005](./ADR-0005-dev-diagram-views-distance-layout.md) — 다이어그램 뷰 및 거리 기반 레이아웃 규칙
- [ADR-0006](./ADR-0006-dev-topology-compilation-distance-diagram.md) — 토폴로지 컴파일, 거리 추출, 그리고 자동 다이어그램 생성
- [ADR-0053](./ADR-0053-dev-topology-builder-algorithms.md) — Topology Builder + Visualizer Algorithms
### Tensor Deployment and Allocation
- [ADR-0008](./ADR-0008-api-tensor-deploy-and-allocation.md) — 텐서 배포 및 할당 (호스트 할당기, PA 우선)
### Kernel Execution and Host-Device Messaging
- [ADR-0009](./ADR-0009-api-kernel-execution-messaging.md) — 커널 실행 메시징 및 완료 시맨틱
- [ADR-0012](./ADR-0012-api-host-io-message-schema.md) — Host ↔ IO_CPU 메시지 스키마 (PA-우선, PE-태깅)
### CLI Surface and Semantics
- [ADR-0010](./ADR-0010-api-cli-surface-and-semantics.md) — 명령줄 인터페이스 및 실행 시맨틱
### Component Port/Wire Fabric Model
- [ADR-0015](./ADR-0015-dev-component-port-wire-model.md) — 컴포넌트 포트/와이어 모델과 패브릭 라우팅
### Two-Pass Data Execution
- [ADR-0020](./ADR-0020-prog-data-execution-two-pass.md) — 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)
### 2D Grid Program Identity
- [ADR-0022](./ADR-0022-prog-program-id-2d-grid.md) — 2D 그리드 program_id 시맨틱
### Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)
- [ADR-0024](./ADR-0024-par-sip-tp-launcher.md) — SIP-level Launcher — rank = SIP
- [ADR-0026](./ADR-0026-par-dppolicy-intra-device.md) — DPPolicy = Intra-Device Only — sip/num_sips 필드 제거
- [ADR-0027](./ADR-0027-par-megatron-tp.md) — Megatron-style Tensor Parallelism API
- [ADR-0047](./ADR-0047-par-ahbm-ccl-backend.md) — AHBM CCL Backend — `torch.distributed`-compat shim
- [ADR-0050](./ADR-0050-par-ccl-algorithm-module-contract.md) — CCL Algorithm Module Contract — `ccl/algorithms/*.py`
### IPCQ Direction Addressing
- [ADR-0025](./ADR-0025-algo-ipcq-direction-addressing.md) — IPCQ Direction Addressing — address-based matching
### Intercube All-Reduce
- [ADR-0032](./ADR-0032-algo-intercube-allreduce.md) — 큐브 간 All-Reduce — pe0 큐브-메시 리듀스 + 다중-SIP 교환
### Evaluation Harnesses
- [ADR-0043](./ADR-0043-eval-allreduce-harness.md) — Allreduce 평가 하니스 — `tests/sccl/`
- [ADR-0044](./ADR-0044-eval-gemm-harness.md) — GEMM 평가 하니스 — `scripts/gemm_sweep.py` + `tests/gemm/`
- [ADR-0054](./ADR-0054-eval-milestone-benches.md) — 마일스톤 평가 bench — 자기완결적 sweep + figure bench
### Bench Module Contract
- [ADR-0045](./ADR-0045-prog-bench-module-contract.md) — Bench Module Contract — registration, dispatch, and authoring
### Kernel-side tl.* API (TLContext)
- [ADR-0046](./ADR-0046-prog-tl-context-contract.md) — TLContext — Kernel-side `tl.*` API Contract
### Memory Allocator Algorithms
- [ADR-0048](./ADR-0048-mem-allocator-algorithms.md) — Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
### Probe Subcommand
- [ADR-0049](./ADR-0049-ver-probe-subcommand.md) — `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
### Sim-engine Op Log and Memory Store Schemas
- [ADR-0052](./ADR-0052-dev-oplog-memory-store-schemas.md) — OpLog + MemoryStore Schemas — sim_engine internals
@@ -7,6 +7,11 @@ Accepted
Documents the `tests/sccl/` evaluation harness; verified against the Documents the `tests/sccl/` evaluation harness; verified against the
implementation (constants, file set, and sweep dimensions cross-checked). implementation (constants, file set, and sweep dimensions cross-checked).
**Amended by ADR-0054**: the driver core, sweeps, and renderers moved into
the `milestone-1h-ccl` bench (single home); `tests/sccl/_allreduce_helpers.py`
now re-exports from it (keeping the pytest-only param builders +
`_run_distributed` wrapper local). The figure tests are unchanged.
## Context ## Context
ADR-0032 defines the intercube all-reduce *algorithm*; ADR-0023/0024/0027 ADR-0032 defines the intercube all-reduce *algorithm*; ADR-0023/0024/0027
+6
View File
@@ -9,6 +9,12 @@ implementation (constants, tile sizes, figure set, and the script↔test
split cross-checked). The D5/D6 caveats are recorded limitations, not split cross-checked). The D5/D6 caveats are recorded limitations, not
inaccuracies. inaccuracies.
**Amended by ADR-0054**: the sweep + renderers moved into the
`milestone-1h-gemm` bench (single home); `scripts/gemm_sweep.py` and
`tests/gemm/` now re-export from it. D1/D2's "data generation stays a manual
script / heavy work is opt-in" is superseded by the eval-bench pattern (one
bench regenerates everything; `MILESTONE_FAST=1` reuses the committed JSON).
## Context ## Context
ADR-0014 (PE pipeline) and ADR-0042 (tile-plan generators) define the GEMM ADR-0014 (PE pipeline) and ADR-0042 (tile-plan generators) define the GEMM
@@ -0,0 +1,295 @@
# ADR-0045: Bench Module Contract — registration, dispatch, and authoring
## Status
Accepted (2026-05-21).
Unifies the `src/kernbench/benches/` registration mechanism (@bench), the
CLI dispatch path (`kernbench run/list`), and the contract a new bench
module must follow. ADR-0010 (CLI surface) specifies the `kernbench
list/run` interface, but **how benches are registered and what signature
they must follow** had no ADR-level coverage.
**Extended by ADR-0054**: D5's single-config rule gains a third pattern —
the *eval bench* (e.g. `milestone-1h-*`) drives many configs, builds its
own per-config engines, and submits a sentinel tensor to satisfy D4.
## First action
When `kernbench.benches` is imported, `__init__.py` immediately calls
`_eager_import_and_audit(__path__, __name__)`. Its first action is to
enumerate every sibling module in the package directory via
`pkgutil.iter_modules(__path__)` and **eagerly import** each one via
`importlib.import_module(...)` — except modules matching either:
- name `registry` (the infrastructure module itself), or
- name starting with `_` (helper modules).
At import time, each `@bench(name=..., description=...)` decorator inside
the imported module runs, appending `(name, description, fn)` to
`_PENDING` and adding `fn.__module__` to `_REGISTERED_MODULES`.
Once imports finish, `_audit_modules(imported, _REGISTERED_MODULES)`
runs; if any imported module did not invoke `@bench` at least once, it
raises `RuntimeError("Bench module(s) missing @bench decorator: ...")`
immediately. At this point indices are still unassigned — the first call
to `list_all()` / `resolve(...)` triggers `_finalize()`, which sorts
`_PENDING` alphabetically by name and assigns 1-based indices.
In short, **the bench infrastructure's first act is "eagerly import
every non-helper module in the package and audit that each one
registered at least one bench"**.
## Context
`src/kernbench/benches/` currently holds 8 bench modules (`ccl_allreduce`,
`gemm_single_pe`, `gpt3_qkv`, `ipcq_allreduce`, `matmul_composite`,
`qkv_gemm`, `qkv_gemm_multi_pe`, `va_offset_verify`). Every bench follows
the same unified flow:
```
kernbench run --topology <T> --bench <N>
cli/main.py::cmd_run
↓ resolve_topology(T) + resolve(N) + resolve_device(device_arg)
runtime_api/bench_runner.py::run_bench(topology, bench_fn, device, engine_factory)
↓ engine_factory(topology, device) → GraphEngine
↓ RuntimeContext(engine, target_device, correlation_id, spec)
bench_fn(ctx) ← invokes the bench's run(torch)
↓ ctx.empty/zeros/from_numpy/launch/distributed.* etc. submit work
ctx.wait_all() ← drains any outstanding handles
BenchResult(completion, correlation_id, trace, traces, engine)
```
ADR-0010 covers only the CLI surface (`run/list/probe/web`); ADR-0007
covers only the runtime API ↔ sim_engine boundary. The question "what
shape must a new bench file take?" had to be answered by grepping the
codebase. As a result:
- The @bench decorator contract (kebab-case name, non-empty description)
lived only in the source.
- The bench function signature (`def run(torch)`) was a de-facto
convention enforced by the CLI dispatcher calling `spec.run`.
- New bench authors learned the "helpers must use `_` prefix" rule only
after seeing the audit's RuntimeError.
- The single-device convention (CLAUDE.md Part 2 CLI Semantics) and its
interaction with multi-SIP CCL benches was ambiguous for bench
authors.
This ADR consolidates all of it in one place.
## Decision
### D1. @bench decorator contract
```python
from kernbench.benches.registry import bench
@bench(name="my-bench", description="Short, complete-sentence description.")
def run(torch):
...
```
- `name`: kebab-case string matching `^[a-z][a-z0-9]*(-[a-z0-9]+)*$`.
Lowercase letters, digits, and dashes only; underscores forbidden;
must start with a letter.
- `description`: non-empty string (stripped length > 0). Displayed
verbatim by `kernbench list`.
- The decorator **returns the function unchanged** — direct invocation
is fine. Its only side effect is appending to `_PENDING`.
Violations of the first two rules raise `ValueError` at decoration time.
Duplicate names are caught at `_finalize()` with
`RuntimeError("duplicate bench name: ...")`.
### D2. Module-file convention
Every `src/kernbench/benches/<slug>.py` must be one of:
- **A bench module**: at top-level import, `@bench(...)` runs at least
once to register at least one bench.
- **A helper module**: the filename starts with `_` (e.g.,
`_shared_helpers.py`). `iter_modules` skips it.
The audit (`_audit_modules`) rejects any non-helper that fails to call
`@bench`. Intended consequence: dropping a new file into `benches/`
automatically registers its benches, and helper modules are clearly
flagged by their filename prefix alone.
### D3. The bench function signature is `def run(torch)`
The decorator does not enforce a function name, but **CLI dispatch calls
`spec_entry.run`** (the decorated callable). The convention is therefore:
- Function name: `run`. Other names work, but always use `run` for
readability and grep-ability.
- Argument: a single positional `torch`. In practice this is a
`RuntimeContext` instance exposing PyTorch-style namespaces
(zeros/empty/launch/distributed/...) — see ADR-0024 D3.
- Return value: any (`Any`). `run_bench` ignores it and tracks
completion via `ctx.handles()` / `engine.get_completion()`.
The `torch` name imitates a PyTorch-compatible idiom; the actual PyTorch
module is not passed in (aligned with ADR-0024's "rank = SIP" launcher
convention).
### D4. A bench must submit at least once
If `ctx.handles()` is empty after the bench returns, `run_bench` reports
`BenchResult.completion = ok=False, error_code="NO_REQUESTS"`. So a
meaningful bench must invoke at least one of:
- Tensor-creation APIs: `torch.zeros(...)`, `torch.empty(...)` — these
internally submit `MmuMapMsg` and (for zeros) `MemoryWriteMsg`.
- Kernel-launch API: `torch.launch(name, fn, *args)` — submits per-SIP
`KernelLaunchMsg`.
- (Exception) Empty placeholder benches: e.g.,
`ipcq_allreduce.py`'s `print(...)`-only stub will receive a
NO_REQUESTS result. CI is expected to recognize and handle placeholder
benches specially.
### D5. Single-device convention + multi-SIP exception (ADR-0024/0027)
CLAUDE.md Part 2 CLI Semantics' **"benchmarks MUST remain
single-device"** rule is interpreted as follows:
- **Standard bench (single-SIP use)**: define tensor placement with
`dp = DPPolicy(...)` and launch with `torch.launch(...)`. The SIP
index is chosen by `--device` (CLI's responsibility).
- **CCL bench (multi-SIP use)**: as an exception, use
`torch.distributed.init_process_group(backend="ahbm")` plus
`torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` for the
rank = SIP pattern (ADR-0024 D3). `--device` is ignored (or treated
as `all`); each spawned worker calls `torch.ahbm.set_device(rank)` to
bind to its SIP.
Multi-device patterns outside these two (e.g., one bench function
launching across multiple SIPs in the same process) are forbidden by
this ADR. Even with `--device all`, the bench runs once; multi-SIP use
inside that single run must follow D5's second pattern.
### D6. Name/index resolution (`resolve`)
`resolve(identifier: str)` returns a BenchSpec via:
1. If `identifier.isdigit()`: convert to int and find the spec where
`index ==` that value. If none, `ValueError("No bench with index
...")`.
2. If `identifier in _REGISTRY`: direct lookup.
3. Otherwise: `ValueError("Unknown bench ...")`.
Empty or whitespace-only identifiers raise `ValueError("bench
identifier must be a non-empty string.")`.
The CLI passes `--bench` directly to `resolve`, so users can use either
`kernbench run --bench gemm-single-pe` or `kernbench run --bench 2`.
### D7. Indices are not a stable API
`_finalize()` sorts `_PENDING` alphabetically by name and assigns
1-based indices. Adding a new bench can shift existing benches'
indices. Therefore:
- Human-interactive use: indices are fine.
- Scripts / CI automation: always use the name.
This caveat is documented in `registry.py`'s module docstring.
### D8. Surface RuntimeContext exposes to benches
A bench's `torch` parameter may legitimately use:
- **Tensor creation**: `torch.empty(shape, dtype=..., dp=DPPolicy(...),
name=...)`, `torch.zeros(...)`, `torch.from_numpy(arr)`. All submit
host-side metadata plus device deployment (`MmuMapMsg` +
`MemoryWriteMsg`).
- **Kernel launch**: `torch.launch(kernel_name, kernel_fn, *args)` —
converts `(Tensor, int, float)` positional args to `TensorArg` /
`ScalarArg`, submits per-SIP `KernelLaunchMsg`, and drains.
- **Synchronization**: `torch.wait(handle)`, `torch.wait_all()`
(`run_bench` calls the latter automatically).
- **Distributed**: `torch.distributed.init_process_group(backend="ahbm")`,
`torch.distributed.get_world_size()`,
`torch.distributed.all_reduce(t, op=...)` (ADR-0024/0027).
- **Multi-process (rank = SIP)**:
`torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` (ADR-0024 D3 /
ADR-0027).
- **Device binding**: `torch.ahbm.set_device(rank)` or
`torch.accelerator.set_device_index(rank)` (both point to the same
namespace).
- **IPCQ install**: `torch.install_ipcq(algorithm=..., ccl_yaml=...)`
(ADR-0023 D10).
- **Spec lookup**: `torch.spec` — the dict produced by the topology
builder (system / cube_mesh / HBM parameters etc.). Use it so the
bench does not hardcode topology.yaml values.
Benches must not access RuntimeContext private members (`_handles`,
`_traces`, `_allocators`, etc.) directly. This aligns with ADR-0007's
layer-boundary spirit: bench → runtime API → sim_engine flows in one
direction.
### D9. Environment-variable parameterization is allowed
Benches may parameterize themselves via `os.environ.get(...)`, as
`matmul_composite.py` does for `MATMUL_M`, `MATMUL_K`, `MATMUL_N`,
`MATMUL_DTYPE`, `MATMUL_VARIANT`. Rationale:
- The bench function signature is fixed by D3 to `def run(torch)`, so
positional/keyword arguments cannot carry parameters.
- The env-var pattern is a natural hook for operational sweeps (e.g.,
`MATMUL_VARIANT`).
- External drivers such as `scripts/gemm_sweep.py` (ADR-0044) consume
this hook (it sets `MATMUL_M/K/N/VARIANT` at
`scripts/gemm_sweep.py:115-118`).
When environment variables alter bench behavior, the module docstring
must list every variable used (`matmul_composite.py` is the canonical
example).
## Alternatives Considered
### A1. An explicit manifest file (YAML) listing benches
Rejected. The `@bench` + audit pattern guarantees "drop in file → auto-
register", concentrating cognitive cost in one place (the file itself).
A separate manifest is prone to drift, and helper separation is already
clear via the `_` prefix.
### A2. Allowing the bench's entry-point name in the decorator
(`@bench(name=..., entry="run_xxx")`)
Rejected. Breaks the simplicity of dispatch (`spec.run` is a single
callable). The `run` convention is sufficient; variants can register
multiple `@bench`-decorated functions in the same module.
### A3. A separate `@multi_device_bench` decorator for CCL
Rejected. The two patterns named in D5 (single + ADR-0024 multi-SIP)
cover all 8 current benches. A separate decorator would force dispatch
to branch and add complexity; the multi-SIP intent is already obvious
from the bench's `init_process_group(...)` call.
### A4. Make indices a stable API (registration order or explicit
`index=` argument)
Rejected. D7's trade-off favors user-friendliness — alphabetically
sorted 1-based indices read naturally in the `list` output. Scripts can
use names.
## Consequences
- "How to add a bench" is consolidated in one ADR — new authors only
need to read D1-D3 and D8 without grepping source.
- The `_`-prefixed helper-module pattern is legitimized at ADR level,
so future `benches/_*.py` shared helpers are free to be added.
- The CLI's single-device convention and CCL's multi-SIP exception are
shown to be consistent (D5) — they are orthogonal.
- The rationale for ADR-0044's GEMM eval harness using env-var hooks
(D9) is now ADR-pinned.
- Indices are explicitly unstable (D7), so any CI code calling
`kernbench run --bench 3` is flagged for review after this ADR is
accepted.
@@ -0,0 +1,327 @@
# ADR-0046: TLContext — Kernel-side `tl.*` API Contract
## Status
Accepted (2026-05-22).
Documents the set of `tl.*` primitives exposed by
`src/kernbench/triton_emu/`'s `TLContext`, their semantics, and the two
execution-mode contracts (command-list / greenlet runner). ADR-0014/0020
defines the PE pipeline and the 2-pass execution model, but **the `tl.*`
surface that bench kernel functions call** had no ADR-level coverage.
## First action
When `TLContext(pe_id, num_programs, dispatch_cycles, runner, cube_id,
num_cubes, scratch_base, scratch_size)` is instantiated, the first action
is to initialize six categories of state:
- `self._pe_id`, `self._num_programs`, `self._cube_id`, `self._num_cubes`
values that `tl.program_id` / `tl.num_programs` will return.
- `self._dispatch_cycles` — cycle count emitted as `PeCpuOverheadCmd(cycles)`
at the start of every `tl.*` API call.
- `self._runner``KernelRunner` instance (present → greenlet mode;
absent → command-list mode).
- `self._commands: list[PeCommand] = []` — command-list accumulator
(command-list mode only).
- `self._handle_counter = 0`, `self._completion_counter = 0` — counters
for generating TensorHandle / CompletionHandle ids.
- `self._scratch_base`, `self._scratch_size`, `self._scratch_cursor = 0`
PE-local scratch region (used for math/dot/composite output handle
addresses).
In short, **TLContext's first act is "record where (sip/cube/pe) and at
what scale (num_programs/num_cubes) this kernel instance runs, and pick
its dispatch mode (runner present or not)"**. No SimPy event is created
and no command is emitted at this moment.
The runtime first action happens when the kernel function first calls a
`tl.<api>()`. The standard entry for every `tl.*` API is:
1. Call `self._emit_dispatch_overhead()` — if `dispatch_cycles > 0`,
immediately `_emit` a `PeCpuOverheadCmd(dispatch_cycles)`.
2. Per-API processing (TensorHandle creation, command construction).
3. `self._emit(cmd)` — in runner mode this `greenlet.switch()`es the cmd
to SimPy; in command-list mode it appends to `self._commands`.
## Context
The `tl.*` surface consists of `TLContext`'s methods, and the `tl`
parameter received by a kernel function is one of these objects. The
contract the user (bench author) sees:
- Which primitives exist.
- What data flow each primitive triggers (DMA / compute / IPCQ /
metadata-only).
- How a TensorHandle's `space` and `addr` are decided.
- The difference between command-list and greenlet modes.
ADR-0014 (PE pipeline) defines the PeCommands consumed by PE_SCHEDULER,
but how `tl.*` emits them is a code-only convention. ADR-0020 (2-pass
data execution) mentions greenlet mode in D3 but does not pin down the
signature difference (return-value handling) between the runner /
non-runner paths. This ADR fills the gap.
## Decision
### D1. The `tl` parameter is a `TLContext` instance
A bench kernel function has the signature:
```python
def _kernel(arg1, arg2, ..., tl, **kwargs):
...
```
`tl` is a `kernbench.triton_emu.tl_context.TLContext` instance. The name
imitates real Triton's `triton.language` module; the actual Triton
module is **not** passed in.
The kernel is plain Python — no `yield` or `async`. `tl.*` calls produce
SimPy events, but to the caller they appear synchronous because in
greenlet mode the KernelRunner relays between SimPy and the kernel
(ADR-0020 D3).
### D2. Two execution modes — command-list / greenlet runner
- **Command-list mode (`runner is None`)**: `tl.*` calls append PeCommand
to `self._commands`. DMA / GEMM / Math consume no SimPy time and return
metadata-only TensorHandles (`data=None`). PE_SCHEDULER / sim_engine
later replays the command sequence in time.
- **Greenlet runner mode (`runner is not None`)**: `tl.*` calls
`self._emit(cmd)``runner.switch_to_simpy(cmd)`, handing control to
the parent greenlet (SimPy). The parent distributes the cmd to
components, consumes SimPy time, and (for DMA reads) returns real numpy
data. The kernel receives the result and continues to the next line
(the data-aware execution model from ADR-0020 D3).
The choice of mode is decided by whether a KernelRunner is injected into
the TLContext. The `tl.*` methods themselves are mode-blind — they go
through `_emit()` uniformly.
### D3. Primitive categories
#### D3.1. Reference (no DMA, metadata only)
- `tl.ref(ptr, shape, dtype="f16") -> TensorHandle`: create a handle
referencing HBM data without issuing DMA. Used when the scheduler
streams the data per-tile (e.g., the b operand of a composite GEMM).
#### D3.2. Data movement (blocking, DMA engine)
- `tl.load(ptr, shape, dtype="f16") -> TensorHandle`: HBM → handle.
Emits `DmaReadCmd`. In greenlet mode the returned handle's `.data`
carries real numpy data; in command-list mode it is a placeholder.
The handle has `space="hbm"`, `pinned=True`.
- `tl.store(ptr, handle) -> None`: TCM → HBM. Emits `DmaWriteCmd`. In
greenlet mode, when `handle.data` is present, `_store.write("hbm",
ptr, data)` runs first (visibility = issue time, ADR-0020 D3).
#### D3.3. GEMM / compute (blocking)
- `tl.dot(a, b) -> TensorHandle`: `a @ b`. Both operands must live in
TCM; shapes `(M,K) × (K,N) → (M,N)`. Emits `GemmCmd`; the output
handle is allocated from PE-local scratch via
`_make_compute_out(shape, dtype)`.
- `tl.composite(op, a, b=None, out_ptr=0, math_op=None, epilogue=None,
acc_dtype=None, tile_shape=None) -> CompletionHandle`: non-blocking
tiled pipeline. Emits `CompositeCmd`. `epilogue` is a list of dicts,
each with `"op"` plus op-specific fields and an optional `"scope"`
(k_tile / output_tile). Unknown ops or missing fields raise
ValueError immediately. The returned CompletionHandle synchronizes
via `tl.wait(h)`.
#### D3.4. Math: unary (blocking)
- `tl.exp(x)`, `tl.log(x)`, `tl.sqrt(x)`, `tl.abs(x)`, `tl.sigmoid(x)`,
`tl.cos(x)`, `tl.sin(x)` — each emits `MathCmd(op=<name>,
inputs=(x,), out=)`. `out` is scratch-allocated with the same
shape/dtype as `x`.
#### D3.5. Math: binary (blocking)
- `tl.maximum(a, b)`, `tl.minimum(a, b)` — `_binary_math`.
- `tl.fma(a, b, c)` — `a*b + c`. Three inputs.
- `tl.clamp(x, min, max)` — `MathCmd(op="clamp", inputs=(x, min, max))`.
- `tl.where(cond, a, b)` — `MathCmd(op="where", inputs=(cond, a, b))`.
- `tl.softmax(x, axis=-1)` — a single `MathCmd(op="softmax")` so timing
accounts at one dispatch. Phase 2 DataExecutor expands it to the
canonical (x-max → exp → sum → div) sequence.
#### D3.6. Reduction (blocking)
- `tl.sum(x, axis)`, `tl.max(x, axis)`, `tl.min(x, axis)` — return an
output handle with the axis size collapsed to 1. Emit
`MathCmd(op=<name>, inputs=(x,), out=, axis=axis)`.
#### D3.7. Index / scalar (PE_CPU, no engine)
- `tl.program_id(axis=0) -> int`: `axis==0` → pe_id (cube-local PE
index), `axis==1` → cube_id (ADR-0022).
- `tl.num_programs(axis=0) -> int`: `axis==0` → num_programs (PEs per
cube), `axis==1` → num_cubes.
- `tl.arange(start, end, dtype="i32") -> TensorHandle`: an index range
in TCM. No command emitted.
- `tl.zeros(shape, dtype="f16") -> TensorHandle`, `tl.full(shape,
value, dtype="f16") -> TensorHandle`: TCM placeholder. No command
emitted.
#### D3.8. Scalar helpers (no command, no engine)
- `TLContext.cdiv(a, b) -> int` (static): ceiling division
`-(-a // b)`. Mirrors real Triton's `tl.cdiv`.
#### D3.9. Metadata-only (no compute, no DMA)
- `tl.trans(x) -> TensorHandle`: a new handle with the last two dims
swapped. Shares `addr` and `data`; no command emitted.
#### D3.10. IPCQ (CCL) primitives (ADR-0023 D4)
- `tl.send(dir, src=None, *, src_addr=None, nbytes=None, shape=None,
dtype="f16", space="tcm") -> None`: blocking send. Accepts either
handle form or raw-address form. Emits `IpcqSendCmd`. The handle's
`.data` snapshot rides along on the command — avoiding the race
where a later inbound IPCQ overwrites the slot before the outbound
PE_DMA reads it.
- `tl.recv(dir=None, shape=(), dtype="f16", space="tcm", dst_addr=None,
dst_space=None) -> TensorHandle`: blocking recv. Providing both
`dst_addr` and `dst_space` enters "copy_to_dst" mode; otherwise
"return_slot" mode. In greenlet mode the handle's `.data` carries
the real data.
- `tl.recv_no_consume(dir=None, shape=(), dtype="f16") -> TensorHandle`:
**DIAGNOSTIC ONLY**. Has the same blocking-arrival semantics as
`tl.recv` but skips the slot-read latency charge (slot-IO + PE↔bank
fabric drain). Used in the pe2pe overview plot for an apples-to-apples
comparison against `tl.store`. Production kernels MUST NOT use it —
the diagnostic flag is isolated in its own command branch
(`consume=False`) so it cannot be accidentally enabled.
- `tl.recv_async(dir, shape=(), dtype="f16") -> RecvFuture`: non-blocking
recv. Returns a `RecvFuture`; resolved later by `tl.wait(future)`.
#### D3.11. Composite + control
- `tl.composite(...)`: see D3.3.
- `tl.wait(handle=None)`: wait on a `CompletionHandle` (composite), a
`RecvFuture` (async recv), or `None` (all pending composites).
- `tl.cycles(n)`: declare a scalar PE_CPU overhead. Emits
`PeCpuOverheadCmd(cycles=n)`.
### D4. TensorHandle arithmetic operators — thread-local TLContext
At module load, `tl_context.py::_enable_tensor_ops()` runs and patches
`TensorHandle.__add__`, `__sub__`, `__mul__`, `__truediv__`. Each
operator calls `_binary_math` on the active TLContext stored in a
module-level thread-local `_ctx`.
So inside a kernel, `c = a + b` is equivalent to emitting
`MathCmd(op="add", inputs=(a, b), out=)` and returning a new
TensorHandle.
Active-TLContext management:
- `TLContext._set_active(ctx)`: set the active ctx for the current
thread/greenlet.
- `TLContext._get_active()`: read it (RuntimeError if unset).
- `run_kernel(kernel_fn, tl_ctx, *args, **kwargs)`: helper. Sets active
on entry, runs the kernel, restores `None` on exit.
`KernelRunner` re-asserts `_set_active(tl)` inside its `_switch_kernel`
just before resuming the kernel, so a sibling PE runner that overwrote
the thread-local context is correctly recovered.
### D5. Scratch allocator — compute output handles
Ops that produce a result — `tl.dot`, `tl.exp`, `tl.add` (via
TensorHandle `__add__`), etc. — call `_make_compute_out(shape, dtype)`
to obtain a 16-byte-aligned scratch address. The address is published
with `space="tcm"`, so the handle can later be the source of a
`tl.send` / `tl.store`.
When `_scratch_base == 0` (e.g., command-list mode), the address is 0
and the handle cannot be a send/store source (in that case, only
`tl.load`-returned handles are valid sources).
When the cursor exceeds `_scratch_size` (default 1 MiB), a
RuntimeError is raised. The cursor must reset between kernel
invocations (current code naturally satisfies this: KernelRunner
creates a fresh TLContext each time).
### D6. Dispatch overhead — `PeCpuOverheadCmd(dispatch_cycles)`
Every non-metadata `tl.*` call starts with `_emit_dispatch_overhead()`,
which — when `dispatch_cycles > 0` — emits
`PeCpuOverheadCmd(dispatch_cycles)`. This models the cycles PE_CPU
spends dispatching the command.
Defaults:
- `TLContext.__init__`'s `dispatch_cycles` parameter default: `1` cycle.
- TLContext built by `KernelRunner`: `0` cycles (greenlet mode handles
cycle accounting differently — aligned with ADR-0020 D3 intent).
### D7. Kernel registry (`triton_emu/registry.py`)
A separate `_kernels: dict[str, Callable]` holds the name → function
mapping:
- `register_kernel(name, fn)`: ValueError on duplicate.
- `get_kernel(name)`: KeyError if missing.
- `clear_registry()`: test-only.
`RuntimeContext.launch(kernel_name, kernel_fn, *args)` overwrites
`_kernels[kernel_name] = kernel_fn` on every call (last-call-wins,
idempotent) — consistent with ADR-0045 D8's `launch` behavior.
PE_CPU looks up `KernelRef.name` in the registry and runs the function
through KernelRunner.
## Alternatives Considered
### A1. Fold `tl.*` into ADR-0014 / ADR-0020
Rejected. ADR-0014 covers the PE pipeline (sim_engine-side consumption
of PeCommands); ADR-0020 covers 2-pass execution (Phase 1 timing /
Phase 2 data). The `tl.*` surface is what the kernel author touches; a
dedicated ADR improves findability and onboarding.
### A2. Deprecate command-list mode
Rejected (currently). Simple unit tests and kernel verification benefit
from the lighter command-list path — it exposes a PeCommand sequence
inspector without requiring greenlet machinery. When greenlet-mode
semantics (real data, Phase 2) are needed, D2 explicitly selects them.
### A3. Remove TensorHandle arithmetic operators
Rejected. They mimic real Triton kernel ergonomics (e.g., `c = a + b`),
and the thread-local active-ctx pattern works cleanly. The explicit
function-form (`tl.add(a, b)`) is also exposed in D3.5, so the
operators are syntactic sugar.
### A4. Expand softmax into the explicit sequence (max → exp → sum → div)
Partially adopted. `tl.softmax` is a single `MathCmd(op="softmax")` for
timing accounting (D3.5), but Phase 2 DataExecutor expands it to the
canonical sequence for real-data computation. Timing model atomic,
data model expanded — the two split intentionally.
## Consequences
- Every `tl.*` primitive a bench author meets is classified and defined
in a single ADR. Paired with ADR-0045 D8's host-side surface
(`torch.empty` etc.), the inside-kernel and outside-kernel authoring
guides are now complete.
- The command-list / greenlet difference is pinned in D2, so any new
`tl.*` primitive that follows the `_emit()` pattern auto-supports
both modes.
- The thread-local active-ctx pattern (D4) is justified at ADR level,
clarifying who owns the reset responsibility when multiple PE
runners share a thread (KernelRunner.run's contract restores active
inside `_switch_kernel`).
- `tl.recv_no_consume`'s diagnostic isolation (D3.10) is hardened in
ADR form — accidental production use is blocked by a separate
command branch.
- The registry (D7) gets its own D-section, formalizing the
name-collision and dynamic-re-registration semantics.
+259
View File
@@ -0,0 +1,259 @@
# ADR-0047: AHBM CCL Backend — `torch.distributed`-compat shim
## Status
Accepted (2026-05-22).
Pins down what `runtime_api/distributed.py`'s `AhbmCCLBackend` +
`DistributedContext` actually install — i.e., the entry point
`torch.distributed.init_process_group(backend="ahbm")` — and how
`all_reduce`/`barrier`/`get_rank` etc. are implemented. ADR-0023 D11
mentions the "torch.distributed compatibility" intent, but **the backend
itself** had no ADR-level coverage.
## First action
`RuntimeContext.__post_init__` automatically constructs a
`DistributedContext()` and attaches it to `self.distributed`. The first
action at that moment:
1. `self._backend: AhbmCCLBackend | None = None` — uninitialized.
2. `self._rank_by_greenlet: dict = {}` — greenlet-local rank registry
(ADR-0024 D2).
3. The caller (RuntimeContext) sets `dc._ctx_ref = self` so subsequent
`init_process_group` can reach `ctx.engine` / `ctx.spec` / `ctx.launch`.
In short, **DistributedContext's first act is "attach to RuntimeContext
with a back-reference and leave the backend slot empty"**. Actual
backend installation (IPCQ install, world_size derivation, algorithm
module import) happens only when user code calls
`torch.distributed.init_process_group(backend="ahbm")`.
At that moment, `init_process_group`'s first action is:
1. If `backend != "ahbm"`, raise `ValueError("Unsupported backend ...")`
immediately.
2. If `getattr(self, "_ctx_ref", None)` is None,
`RuntimeError("DistributedContext not bound to a RuntimeContext")`.
3. `self._backend = AhbmCCLBackend(torch_ctx=ctx)` — inside this
constructor, ccl.yaml is loaded, the algorithm module is imported,
world_size is derived, SFR is configured, and IPCQ is installed.
4. `self._backend._dist_ctx = self` — the backend gets a back-reference
so it can read `_rank_by_greenlet`.
## Context
The `AhbmCCLBackend` exists so that PyTorch DDP collective calls
(`init_process_group`, `all_reduce`, etc.) work unchanged and bench code
reads identically to a real DDP training script (in line with
ADR-0024 + ADR-0027's launcher model).
The backend's responsibilities:
- At `init_process_group` time, install the **IPCQ neighbor table once**
(analogous to NCCL communicator creation).
- For each `all_reduce(tensor, op="sum")`, dispatch the configured
algorithm's kernel function via `ctx.launch(...)`.
- Answer `get_world_size` / `get_rank` consistently from the
greenlet-local rank registry plus ccl.yaml/topology.
ADR-0023 D10 (IPCQ install plan) and ADR-0024 (SIP launcher) touch
parts of this, but **the backend's own responsibility scope and decision
order** are not pinned anywhere. This ADR fills that gap.
## Decision
### D1. The backend is created only at `init_process_group(backend="ahbm")` time
`DistributedContext` starts with `_backend = None`. The backend object
does not exist until the user calls
`dist.init_process_group(backend="ahbm")`. Any other API
(`is_initialized`, `get_world_size`, `all_reduce`, `barrier`) called
while `_backend` is None raises
`RuntimeError("Default process group has not been initialized...")` via
the `_ensure_initialized` helper.
`backend != "ahbm"` raises `ValueError` immediately. Other backend names
(`nccl`, `gloo`, etc.) are not recognized.
### D2. world_size resolution priority — algorithm > defaults > topology
`AhbmCCLBackend._resolve_world_size` (ADR-0024 D1):
1. If `ccl.yaml`'s algorithm entry has `world_size`, use it.
2. Else if `defaults.world_size` is set, use it.
3. Else fall back to `spec.system.sips.count` (the topology's SIP count).
The default interpretation is **rank = SIP** (ADR-0024). Cube/PE-level
parallelism is expressed inside each rank via DPPolicy and does not
affect world_size. An explicit `ccl.yaml` override is preserved for the
legacy "rank = flat PE index" test path.
User arguments to `init_process_group(world_size=..., rank=...)` are
**accepted but ignored** (same as real PyTorch's `RANK` / `WORLD_SIZE`
env vars).
### D3. `init_process_group` performs four installation steps
Inside `AhbmCCLBackend.__init__`, in order:
1. **Load ccl.yaml**: `kernbench.ccl.install.load_ccl_config()`
`resolve_algorithm_config(_cfg_all)` produces the merged config for
`defaults.algorithm` (or the user-specified algorithm).
2. **Import algorithm module**:
`importlib.import_module(self._merged["module"])`. The module must
expose a `kernel` function, a `kernel_args(world_size, n_elem,
cube_w, cube_h)` helper, and optionally a `TOPO_NAME_TO_KIND` map.
3. **Resolve world_size** (D2).
4. **Collect topology metadata** from `spec`: `n_sips`, `sip_topo`
(`ring_1d` default), `cube_w`/`cube_h`, `sips.w`/`sips.h`. When the
SIP topology is not `ring_1d`, derive `_sip_topo_w/h` from explicit
`w`/`h` or via square-root (require `w*h == n_sips`). Mismatch raises
`ValueError`.
5. **Install SFR + IPCQ**:
`kernbench.ccl.sfr_config.configure_sfr_intercube_multisip(engine,
spec, self._merged)`. This pushes IPCQ neighbor tables to every
SIP/cube's pe0 (one-time setup analogous to NCCL communicator
creation).
If the order changes (e.g., SFR runs before the algorithm module
loads), partial initialization can result. So D3 is treated as an
atomic 4-step block — on failure the backend remains uninstalled.
### D4. Greenlet-local rank binding (ADR-0024 D2)
`DistributedContext._rank_by_greenlet: dict[greenlet, int]` maps spawned
worker greenlets to their ranks. When the bench launcher (e.g.,
`torch.multiprocessing.spawn`) spawns a worker, it registers via
`dc._bind_rank(g, rank)`.
`get_rank()` looks up `getcurrent()`'s greenlet. Unregistered greenlets
fall back to 0 — preserves single-driver / test compatibility.
The backend reads the current greenlet's rank from
`_dist_ctx._rank_by_greenlet` during `all_reduce` (D5).
### D5. `all_reduce(tensor, op="sum")` behavior
Validation:
- `op != "sum"``NotImplementedError`. Current kernels only
implement add reduction.
- `tensor._handle is None``RuntimeError("not deployed")`.
- `tensor._handle.shards` empty → `RuntimeError("no shards")`.
Preparation:
- `n_elem = shards[0].nbytes // tensor.itemsize` — element count of a
single shard.
- `kernel_fn = self._algo_module.kernel` — the algorithm module's entry
function (imported in D3).
- Decide effective cube dims: if the first SIP has just 1 cube, use
`(1, 1)`; otherwise use the topology's `cube_w`/`cube_h`. This
naturally absorbs TP runs that use only a subset of cubes.
- `kernel_args = self._algo_module.kernel_args(world_size, n_elem,
cube_w, cube_h)` — the algorithm decides which arguments to pass to
its kernel.
Dispatch:
- Resolve the current greenlet's rank via
`_rank_by_greenlet.get(g, 0)`.
- Append `extra_args = (sip_rank, sip_topo_kind, sip_topo_w,
sip_topo_h)`.
- `pending = self.ctx.launch(algorithm_name, kernel_fn, tensor,
*kernel_args, *extra_args, _defer_wait=True)` — `_defer_wait=True`
delegates collective drain to the main scheduler (ADR-0027 D0.4).
Drain:
- If the parent greenlet is alive (multi-greenlet mode), enqueue
`_pending_collective_handles` and switch to parent. The main
scheduler drains after all ranks have launched.
- If single-driver mode, drain inline:
`for h, _sip_id, meta in pending: self.ctx.wait(h, _meta=meta)`.
### D6. `barrier()` is a no-op (single-driver model)
kernbench runs all ranks as greenlets inside a single Python process,
so no cross-process synchronization is needed. `barrier()` is callable
but does no synchronization. Kept for real-PyTorch API compatibility so
callers don't get `NotImplementedError`.
If multi-process kernbench (SimPy event loop per process) is introduced
in the future, D6 needs a superseding ADR.
### D7. Semantics of `get_rank` / `get_world_size` / `get_backend`
- `get_rank()` (D4): the current greenlet's bound rank; unregistered → 0.
- `get_world_size()` (D2): the world_size resolved by the backend in D3.
- `get_backend()`: always the literal string `"ahbm"`. Calling before
backend exists triggers `_ensure_initialized`'s RuntimeError.
Differences vs. real PyTorch:
- Real PyTorch `get_rank()` is a process-global value; here it is
greenlet-local. Inside a spawned worker → the worker's rank; in the
main thread → 0. Bench authors should expect meaningful ranks only
inside worker functions.
### D8. Supported API surface (final)
`DistributedContext` exposes:
- `init_process_group(backend="ahbm", world_size=None, rank=None,
**kwargs)`
- `is_initialized() -> bool`
- `get_world_size() -> int`
- `get_rank() -> int`
- `get_backend() -> str`
- `all_reduce(tensor, op="sum") -> None`
- `barrier() -> None`
- (internal) `_bind_rank(g, rank)`
Other PyTorch distributed APIs (`broadcast`, `reduce`, `all_gather`,
`gather`, `scatter`, point-to-point `send/recv`, etc.) are **not
implemented**. Kernel-level expression is available via
`tl.send`/`tl.recv` (ADR-0046 D3.10), but the `dist.*` surface does not
expose them. If additional collectives are needed, add a paired
(algorithm module, `DistributedContext` method) and extend D8.
## Alternatives Considered
### A1. Create the backend in `RuntimeContext.__init__`
Rejected. If `ccl.yaml` is missing or the algorithm module can't be
imported, RuntimeContext construction would fail even when the bench
does not use distributed features. Lazy creation at call time (D1) is
the right semantics.
### A2. Always derive world_size from topology (no override)
Rejected. ADR-0024 D1's "explicit override" path is used by legacy
tests. Diagnostic scenarios that define PE-level ranks within a single
SIP also need this escape hatch.
### A3. Silent fallback for unsupported `op`
Rejected. If the user intends `op="prod"` / `"max"` / `"avg"` and silent
`sum` runs instead, result validation gets very hard. Explicit
`NotImplementedError` is safer.
### A4. Implement `barrier` as a SimPy event
Rejected (currently). With single-driver semantics there is no
cross-process synchronization to express, so a no-op is meaningfully
correct. A fake-barrier SimPy event would add code complexity for no
semantic gain. Revisit when multi-process kernbench arrives.
## Consequences
- The 4-step installation (D3) for
`torch.distributed.init_process_group(backend="ahbm")` is locked in,
making clear where future collective algorithms must hook.
- The priority order in D2 (algorithm > defaults > topology) makes the
blast radius of ccl.yaml changes quickly knowable.
- The no-op `barrier` (D6) is recorded so multi-process kernbench, if
introduced, must explicitly supersede this ADR.
- D8's list of unsupported APIs explicitly grounds the rejection
message when users call, e.g., `dist.broadcast(...)`.
@@ -0,0 +1,278 @@
# ADR-0048: Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
## Status
Accepted (2026-05-22).
Pins down the free-list algorithm, page alignment, and coalescing rules
used by `policy/address/allocator.py`'s `_FreeList` / `PEMemAllocator`
and `va_allocator.py`'s `VirtualAllocator`. ADR-0001 (PhysAddr layout)
and ADR-0011 (PA/VA/LA models) define the address schemes; the
**allocation algorithms** had no ADR-level coverage.
## First action
### `_FreeList(capacity)`
On construction: `self._capacity = capacity`, `self._used = 0`,
`self._free = [(0, capacity)]`. The first act is **establishing the
entire region as one free block** — the tuple `(offset=0,
size=capacity)` is the sole entry in the free list.
### `PEMemAllocator(sip_id, die_id, pe_id, cfg)`
On construction, builds two `_FreeList`s:
- `self._hbm = _FreeList(cfg.hbm_slice_bytes)` — the size of this PE's
HBM slice (`hbm_bytes_per_cube // hbm_slices_per_cube`).
- `self._tcm = _FreeList(cfg.tcm_allocatable_bytes)` — equals
`tcm_bytes_per_pe - tcm_scheduler_reserved_bytes` (the scheduler
reservation is pre-deducted).
So PEMemAllocator's first act is **constructing single-free-block
HBM-slice and TCM regions for this PE**.
### `VirtualAllocator(va_base, va_size, page_size=2*1024*1024)`
On construction: `self._va_base = va_base`, `self._va_size = va_size`,
`self._page_size = page_size`, `self._used = 0`, `self._free =
[(va_base, va_size)]`. The first act is **establishing one block from
va_base to va_size and stashing page_size**.
## Context
`runtime_api/context.py::_ensure_allocators` builds the allocator set
in these stages:
1. Read `hbm_total_gb_per_cube`, `hbm_slices_per_cube`, `tcm_size_mb`,
per-target_device SIP range, etc. from `spec`.
2. Pack everything into a frozen `AddressConfig`.
3. For every combination in the target SIP range × cubes × PEs,
construct one `PEMemAllocator(sip, cube, pe, cfg)` instance.
4. Construct one `VirtualAllocator(va_base=0x1_0000_0000, va_size=64
GiB, page_size=pe_mmu.page_size)`.
Allocator responsibilities:
- **PEMemAllocator**: PA-space allocation in the PE-local HBM slice /
TCM (including PhysAddr encoding).
- **VirtualAllocator**: device-wide VA allocation, page-aligned.
`RuntimeContext._create_tensor` then pushes VA → PA mappings to
components via `MmuMapMsg`.
These algorithms are:
- **First-fit**, kept simple.
- The free-block list is **sorted by start offset**.
- On `free()`, **adjacent blocks coalesce**.
The rationale was not documented anywhere, so when someone asks "why
not best-fit?", "why not a buddy allocator?", "why does partial-overlap
free pass silently?", there was no anchor to answer from. This ADR
provides it.
## Decision
### D1. `_FreeList` — offset-keyed first-fit + coalescing
`policy/address/allocator.py::_FreeList`:
- Internal representation: `list[tuple[int, int]] = [(start_offset,
size), ...]` — sorted by start offset.
- `alloc(nbytes)`:
1. Iterate the free list from the front (first-fit).
2. Take from the first block with `size >= nbytes`.
3. Exact match → drop the block; otherwise shrink it to `(start +
nbytes, size - nbytes)`.
4. `_used += nbytes`; return the taken `start`.
5. If no block fits, `AllocationError("overflow ... largest free
block ...")`.
- `free(offset, nbytes)`:
1. `_used -= nbytes`.
2. `bisect_left(self._free, (offset,))` finds the insertion index.
3. If adjacent to the previous block (`prev_start + prev_size ==
offset`), merge.
4. If adjacent to the next block (`offset + nbytes == next_start`),
merge.
5. Insert the coalesced range at the right sorted position.
This algorithm is weaker than best-fit / buddy on fragmentation, but
the simulator's workload (mostly stack-like deploy/free) tolerates it.
If the workload shape changes, D1 is a supersession candidate.
### D2. Partial-overlap free is **not** validated
`_FreeList.free(offset, nbytes)` trusts the caller to pass the exact
`(offset, nbytes)`. It does **not** verify:
- That the range was actually allocated.
- That the range does not overlap another allocated region.
Reason: in a simulator context, callers always store the return value
of `alloc()` and pass it back to `free()` — there is no external user
input. Adding a safety check would cost O(N) per free and impact
simulation wall-clock.
If this trust model breaks (e.g., a code path lets two tensors point
at the same PA), this ADR must be revisited.
### D3. `PEMemAllocator` — two channels for HBM/TCM
`PEMemAllocator(sip_id, die_id, pe_id, cfg)` holds two `_FreeList`s:
- `_hbm`: size `cfg.hbm_slice_bytes`.
- `_tcm`: size `cfg.tcm_allocatable_bytes` (= `tcm_bytes_per_pe -
tcm_scheduler_reserved_bytes`).
`alloc_hbm(nbytes) -> PhysAddr`:
- `_hbm.alloc(nbytes)` → offset.
- `PhysAddr.pe_hbm_addr(sip_id, die_id, pe_id,
pe_local_hbm_offset=offset, slice_size_bytes=cfg.hbm_slice_bytes)`.
- Failure raises `AllocationError("HBM overflow ...")`.
`free_hbm(pa, nbytes)`:
- Recover PE-local offset via `pa.hbm_offset - pe_id *
cfg.hbm_slice_bytes`.
- `_hbm.free(offset, nbytes)`.
`alloc_tcm(nbytes) -> PhysAddr`: similar; uses `PhysAddr.pe_tcm_addr`.
`free_tcm(pa, nbytes)`: uses `pa.sub_offset` directly (TCM's PE-local
offset equals its sub_offset).
The allocator does not see the scheduler-reserved TCM region
(`cfg.tcm_scheduler_reserved_bytes`) — it is pre-subtracted from the
`_tcm` capacity. This is consistent with ADR-0014's PE_SCHEDULER
internal-buffer reservation.
### D4. `VirtualAllocator` — page-aligned first-fit + coalescing
`policy/address/va_allocator.py::VirtualAllocator`:
- Internal representation: same sorted `list[tuple[int, int]]` as
`_FreeList`. Initially `[(va_base, va_size)]`.
- `_align_up(nbytes) = ceil(nbytes / page_size) * page_size`.
- `alloc(nbytes) -> int`:
1. `aligned = _align_up(nbytes)`.
2. First-fit a block with `size >= aligned`.
3. Take `aligned` from the block's front; remove if exact.
4. `_used += aligned`. Return the block's `start` (which is page-
aligned).
5. Failure → `VaAllocationError`.
- `free(va, nbytes)`: free `_align_up(nbytes)` worth. Coalesces with
the same algorithm as `_FreeList`.
`page_size` has different defaults in two places:
- `VirtualAllocator.__init__`'s parameter default: `2 MiB`. Direct-call
tests receive this.
- `RuntimeContext._ensure_allocators` when constructing the instance:
`pe_mmu.attrs.get("page_size", 4096)` — uses
`topology.yaml`'s `pe_mmu.attrs.page_size` if set, else falls back
to `4 KiB`.
The two defaults differ on purpose: `VirtualAllocator`'s standalone
default (`2 MiB`) aligns with ADR-0039's PE_MMU stopgap default for
direct-test ergonomics; the context fallback (`4 KiB`) is the safe
minimum when `topology.yaml` doesn't specify a page size. The
production path is always the latter (via `_ensure_allocators`), and
when `topology.yaml` sets `page_size`, that value flows consistently
into both the MMU and the VA allocator.
If consistency breaks (e.g., VirtualAllocator instantiated with a
page_size different from PE_MMU's), MMU `map()` falls into the
sub-page region mode (ADR-0039 D3).
VA range defaults: `va_base = 0x1_0000_0000` (= 4 GiB), `va_size = 64
GiB`. These are hardcoded in `_ensure_allocators` and have no
semantic meaning in ADR-0011's VA model — they simply reserve enough
device-wide space without colliding with host code.
### D5. Lifecycle of allocator instances
- `RuntimeContext._ensure_allocators` is lazy — called on the first
`_create_tensor`.
- The allocator dict (`self._allocators`) lives for the
RuntimeContext's lifetime. A second deploy in the same process
does not construct new objects.
- `RuntimeContext.cleanup()` walks living tensors and calls
`_free_tensor()`, which issues MMU unmaps + `va_allocator.free` +
`pemem_allocator.free_hbm` — restoring the free lists. A subsequent
RuntimeContext starts fresh.
This per-RuntimeContext isolation guarantees deterministic deploy →
cleanup → deploy sequences within a single process.
### D6. Allocator failure raises (no silent OOM)
Both `_FreeList.alloc` and `VirtualAllocator.alloc` raise
`AllocationError` / `VaAllocationError` when no block fits. The message
includes "required size + largest available block" to distinguish
fragmentation from true OOM.
A silent fallback (e.g., allocating only as much as the largest free
block) is strictly forbidden — a partially-allocated tensor reaching
SimPy would cause routing / DMA to see incorrect PAs and silently
corrupt simulation results.
### D7. One allocator per address space
Physical address spaces are separated by PhysAddr sub-units (ADR-0001
D2.3); each sub-unit gets its own allocator instance:
- HBM slice → `PEMemAllocator._hbm`.
- PE TCM → `PEMemAllocator._tcm`.
- (Currently unused) M_CPU local memory, CUBE SRAM → would need their
own allocators. Today these are handled as IPCQ-only slots (ADR-0023
D9.7) and do not share PA space, so no free-list exists for them.
When a cube-level SRAM allocator is needed,
`_FreeList(cfg.sram_bytes_per_cube)` is added per-cube
(`cfg.sram_bytes_per_cube` is already defined in `AddressConfig` —
the data model is ready).
## Alternatives Considered
### A1. Best-fit / buddy allocator
Rejected (currently). The workload's alloc/free pattern is stack-like
(deploy order ≈ free order), so first-fit + coalescing controls
fragmentation well enough. If long-running fragmentation appears in LLM
kernel sweeps, a buddy-allocator ADR will replace D1.
### A2. Add partial-overlap free validation
Rejected. D2's trust model plus the O(N) per-free cost makes this
unattractive. A debug mode (e.g., `KERNBENCH_DEBUG` env var) that
enables the check could be added later.
### A3. A unified allocator for VA and PA
Rejected. VA space (64 GiB device-wide) and PA space (per-slice ~6
GiB) have different semantic dimensions — VA is the kernel's view, PA
is the device sub-unit's view. ADR-0011's VA model (MMU maps between
the two) calls for separated allocators.
### A4. Multi-tier page sizes (large pages + small pages)
Rejected (currently). A single page size (2 MiB) matches LLM kernel
tensor sizes (a few MiB to GiB); smaller mappings are absorbed by
ADR-0039 D3's sub-page region mode. Multi-tier paging would require
extending the MMU model itself — a separate ADR candidate.
## Consequences
- The allocator algorithm is pinned at ADR level (D1, D3, D4), so any
future simulation scenario hitting fragmentation has a clear "we're
using first-fit + coalescing" anchor to inspect.
- D2's trust model is explicit, so any future code path that exposes
alloc/free to direct user input will trigger this ADR's supersession
early.
- D7's one-allocator-per-sub-unit mapping is on record, so when M_CPU
or SRAM need their own free-list, the addition point is obvious.
- D4 captures the page_size dual-default and its production path
(`_ensure_allocators` always wins), letting future `topology.yaml`
`page_size` changes be assessed against ADR-0039's stopgap
interaction quickly.
+247
View File
@@ -0,0 +1,247 @@
# ADR-0049: `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
## Status
Accepted (2026-05-22).
Pins down the traffic-pattern catalog, formula-vs-actual comparison, and
invariant checks (monotonicity, D2H ≥ H2D, etc.) exposed by
`probes/probe.py::run_probe(...)`. ADR-0010 (CLI surface) enumerates the
`kernbench probe` subcommand, but **what probe actually measures** and
**which invariants it judges PASS/FAIL** had no ADR-level coverage.
## First action
`run_probe(topology_path, case_filter=None)` performs four startup steps:
1. `Path(topology_path).expanduser().resolve()` → absolute path.
2. `load_topology(path)``TopologyGraph` (graph + spec).
3. `_build_edge_map(graph)` → a `{(src, dst): Edge}` lookup table.
4. Instantiate `AddressResolver(graph)` + `PathRouter(graph)`.
Then it sets `nbytes = 32768` (= 32 KiB, the summary-table reference
size) and `show_all = (case_filter is None or case_filter == "all")`.
In short, **probe's first act is "load the topology once and prepare
edge map / resolver / router, plus pin 32 KiB as the standard measurement
size"**. After that, the H2D → D2H → PE DMA categories execute in
separate `GraphEngine` instances (no cross-talk between cases).
## Context
`kernbench probe` was introduced as a verification tool for these
purposes:
- **Manual ground truth**: when a real-simulation result (`kernbench run
--bench ...`) shows abnormal latency, derive the answer for a simple
traffic pattern in isolation and compare.
- **Formula vs actual**: check whether the analytical model
(wire latency + overhead + drain) matches the simulator's
`total_ns`. A mismatch points to which simplifying assumption in
ADR-0033 is missing.
- **Monotonicity check**: latency should grow monotonically with hop
count.
- **Utilization sweep**: a BW-utilization table across data sizes
(4 KiB ~ 1 MiB).
Without an ADR for this tool:
- Adding a new traffic-pattern category (e.g., MCpuDma, IPCQ) is hard
because the table format / measurement units of existing categories
aren't documented at the ADR level.
- The basis for the monotonicity check (hop count? cube distance? wire
length?) is ambiguous.
- The reference size 32 KiB and the sweep `[4 KiB, 16 KiB, 64 KiB, 256
KiB, 1 MiB]` are only discoverable by reading source.
## Decision
### D1. Three case categories — H2D / D2H / PE DMA
Each category has a distinct data path in the topology and gets its own
summary table + sweep table + route-detail block.
- **H2D (Host → Device Write)**: `MemoryWriteMsg(dst_sip=0, dst_cube,
dst_pe=0, pattern="zero")` flows along `pcie_ep → io_cpu → m_cpu →
hbm_ctrl`. The cube index varies the hop count:
- h2d-1hop: cube=0, hops=1
- h2d-2hop: cube=4, hops=2
- h2d-3hop: cube=8, hops=3
- h2d-4hop: cube=12, hops=4
- **D2H (Device → Host Read)**: `MemoryReadMsg(src_sip=0, src_cube,
src_pe=0)`. Total latency = forward command path + reverse data path.
Same 4-hops category as H2D.
- **PE DMA (PE-initiated)**: `PeDmaMsg(src_sip, src_cube, src_pe,
dst_pa)`. Five cases cover varying cube/PE positions:
- pe-local-hbm: same cube, same PE
- pe-same-half-hbm: same cube, different PE (PE 1)
- pe-cross-half-hbm: same cube, far PE (PE 4)
- pe-cross-cube-hbm-best: adjacent cube (cube 1)
- pe-cross-cube-hbm-worst: diagonal far cube (cube 15)
The cube indices 4/8/12 (H2D) and 1/4/15 (PE DMA) are meaningful for a
4 × 4 cube mesh (`sip.cube_mesh.w=4, h=4`); changes to the mesh size
require these to be updated in lockstep.
### D2. Standard measurement size — `nbytes = 32768` (32 KiB)
Every case in the summary table runs once with `nbytes=32768`. 32 KiB
was chosen because:
- DMA overhead and BW drain are balanced — neither dominates.
- It compares cleanly against the one-shot transfer size of several
sub-units (TCM, register file).
Per-size utilization variations are shown in a separate sweep table
(D3).
### D3. Utilization sweep — `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]`
`SWEEP_SIZES = [4096, 16384, 65536, 262144, 1048576]`,
`SWEEP_LABELS = ["4KB", "16KB", "64KB", "256KB", "1MB"]`. Per size:
```
drain = nbytes / bottleneck_bw
total = overhead + wire + drain
eff_bw = nbytes / total
util% = eff_bw / bottleneck_bw × 100
```
When `bn_bw is None or <= 0`, the column shows 0.0 %. The intent: the
table shows in one view how small transfers become overhead-bound and
large transfers become drain-bound as hop count rises.
### D4. Measured columns — actual / formula / breakdown
Per-case columns:
- `Actual` (total_ns): the SimPy run's `trace["total_ns"]`.
- `Ovhd`: sum of `node.attrs["overhead_ns"]` along the path (formula).
- `Drain`: `nbytes / min(edge.bw_gbs over path)` (formula).
- `Wire`: `Σ edge.distance_mm * (ns_per_mm from spec)`.
- `Ovhd%` / `Drain%`: each portion as a percentage of Actual. Wire is
usually too small to display.
- `Eff.BW`: `nbytes / total_ns` (measured BW).
- `BN.BW`: bottleneck bandwidth (formula). The minimum edge BW along
the path. Missing edge BW shows "-".
- `Util%`: `Eff.BW / BN.BW × 100`. 100 % means the single-stream BW
upper bound is reached.
A large gap between the formula sum (`wire + ovhd + drain`) and Actual
signals a factor the simplified model misses (a place to inspect
ADR-0033's assumptions).
### D5. Automatic invariant checks — PASS/FAIL
The following invariants are reported with `[v] PASS` / `[x] FAIL`:
- **H2D / D2H monotonic increase**: as hop count rises, actual latency
must grow monotonically. `all(lats[i] < lats[i+1] for ...)`.
- **D2H ≥ H2D**: for the same hop index, D2H ≥ H2D (D2H has both
forward command and reverse data legs). `all(d2h[i].total >=
h2d[i].total)`.
- **PE DMA best < worst**: cross-cube best (adjacent) latency must be
less than cross-cube worst (diagonal).
- **PE DMA local vs remote**: prints the local BN BW vs remote BN BW
side-by-side (informational, not PASS/FAIL).
When a check fails, a single clear line surfaces the regression for
human review.
### D6. Route detail — per-hop timestamp trace
After the summary and sweep tables, each case's path and cumulative
per-hop timestamps (`_hop_timestamps`) appear in a separate section:
- H2D: leg1 (`pcie_ep → io_cpu`) + leg2 (`io_cpu → m_cpu`) + leg3
(`m_cpu → hbm_ctrl`) + per-hop trace.
- D2H: forward (cmd, no data) and reverse (data) traces shown
separately.
- PE DMA: `pe_dma → router → hbm_ctrl` path + per-hop trace.
Each hop's timestamp is cumulative `wire_ns + overhead_ns`. The
terminal hop's annotation appends `drain:Xns`. Bottleneck edges are
marked `<BN:XXGB/s>` so they are visually identifiable.
### D7. Semantics of the `case_filter` argument
- `None` or `"all"`: run all cases (default).
- Other strings: run only the case whose name matches exactly. Example:
`kernbench probe --case h2d-2hop`.
Within a category, cases with `name != case_filter` are skipped; if
only one data point remains, the category's monotonicity / D2H ≥ H2D
comparisons are naturally skipped.
The CLI parser's `--case` default is `"all"`, so omitting it runs
everything.
### D8. Fresh GraphEngine per case
Each of the 4 H2D, 4 D2H, and 5 PE DMA cases runs in **its own
GraphEngine** (`engine = GraphEngine(graph)`). Reasons:
- Isolate accumulated state (op_log, completion tracking, allocators)
so cases do not cross-talk.
- Guarantee one case's traffic does not perturb another case's BW
measurement.
This isolation lets probe results be interpreted as **single-flow**
per-case latency. Multi-flow contention measurement is handled by
separate tooling (e.g., the `pe2pe_overview` plot or ADR-0033's
multi-flow merging model).
### D9. Output-format stability
probe's stdout is meant for humans; precise column widths, separators,
and whitespace are **not** a machine-readable contract. Automated tools
that wish to parse probe output should use a separate JSON-output mode
(not yet implemented).
The `[v]` / `[x]` prefix on PASS/FAIL lines is a stable CI grep anchor.
## Alternatives Considered
### A1. Register probe as another bench (`@bench(name="probe")`)
Rejected. probe is a verification tool, not a bench — multi-engine
execution for sweeps/analysis and PASS/FAIL invariant output are
essential, none of which fits ADR-0045's "single device + single
RuntimeContext" bench model.
### A2. Exit code 1 on monotonicity violation
Rejected (currently). probe is positioned as a human inspection tool —
PASS/FAIL is printed and exit is 0. A wrapper can `grep "\[x\]"` to
decide. A future `--strict` flag could opt into non-zero exits.
### A3. Externalize the case catalog to YAML
Rejected (currently). The 8 cases (4 H2D + 4 D2H + 5 PE DMA = 13 total)
are hardcoded and their semantics are tightly bound to the mesh
topology. Moving cube-index meaning (4, 8, 12 / 1, 4, 15) into YAML
would require separate documentation and lose cohesion. Externalize
only when case additions become frequent.
### A4. Add multi-flow contention measurement
Rejected (out of probe scope). D8's single-flow isolation is probe's
core intent. Multi-flow contention belongs in a different area of the
ADR-0033 latency model — either a separate tool or a new case
category.
## Consequences
- probe's case catalog (D1) and measurement units (D2/D3) are pinned at
ADR level, so new traffic categories know which table format to
follow.
- The semantics of the formula-vs-actual columns (D4) are locked in, so
questions like "why is Drain% 5 % or 70 %?" can quickly be linked to
ADR-0033 assumption checks.
- Automatic invariant checks (D5) are pinned, so latency-model changes
immediately catch monotonicity / D2H ≥ H2D regressions.
- D8's case-isolation is explicit, so probe results are safe to read as
single-flow measurements. If multi-flow is needed, a separate tool
track is clearly required.
- A2's strict-mode flag is recorded as a follow-up so CI integration
has a minimal change path when requested.
@@ -0,0 +1,322 @@
# ADR-0050: CCL Algorithm Module Contract — `ccl/algorithms/*.py`
## Status
Accepted (2026-05-22).
Pins down the interface, kernel signature, and addition workflow that a
module under `src/kernbench/ccl/algorithms/` must satisfy in order to be
used as a collective algorithm by the AHBM CCL backend (ADR-0047).
ADR-0047 D3 states only that "the algorithm module must expose `kernel`,
`kernel_args`, optionally `TOPO_NAME_TO_KIND`"; **the contract an
algorithm-module author needs to follow** has had no ADR-level coverage.
This ADR pairs with ADR-0045's bench-module contract.
## First action
An algorithm module is imported at two moments:
1. **AHBM backend entry**: when user code calls
`dist.init_process_group(backend="ahbm")`,
`AhbmCCLBackend.__init__` runs
`self._algo_module = importlib.import_module(self._merged["module"])`.
At module level, the following occur first:
- Topology-kind integer constants like `SIP_TOPO_RING/TORUS/MESH`
are bound in the module namespace.
- The `TOPO_NAME_TO_KIND` dict is bound; the backend reads it via
`getattr(self._algo_module, "TOPO_NAME_TO_KIND", None)`.
- `kernel_args` function is defined for the caller.
- The actual algorithm function (e.g.,
`allreduce_intercube_multidevice`) is defined.
- At the bottom of the module, `kernel = allreduce_intercube_multidevice`
publishes the alias.
2. **ccl.yaml install stage**:
`kernbench.ccl.install.install_ipcq` imports the same algorithm
module while pushing the IPCQ neighbor table.
In short, **the algorithm module's first act is "publish topology-kind
constants, the `TOPO_NAME_TO_KIND` dict, the `kernel_args` function, and
the `kernel` alias into the module namespace"** — all as import-time
side effects, no separate initialization call.
## Context
`AhbmCCLBackend` (ADR-0047), at process-group creation, dynamically
imports a module path obtained from `ccl.yaml`'s `defaults.algorithm` (or
a user-specified algorithm). The backend expects four things from the
module:
- `kernel`: the collective's entry function.
- `kernel_args(world_size, n_elem, cube_w=, cube_h=) -> tuple`: a tuple
packing the kernel's positional arguments.
- `TOPO_NAME_TO_KIND` (optional): a dict mapping `topology.yaml`'s
`sips.topology` string (e.g., `"ring_1d"`, `"torus_2d"`,
`"mesh_2d_no_wrap"`) to the integer kind constants.
- (Indirectly) IPCQ neighbor-table install:
`configure_sfr_intercube_multisip` reads
the module's `TOPO_NAME_TO_KIND` plus cube dimensions to decide the
SFR.
The current corpus has one algorithm module:
`lrab_hierarchical_allreduce.py` (248 lines). The name expands to
"**l**eft-**r**ight **a**lternating **b**roadcast hierarchical allreduce".
When future modules like `ring_allreduce`, `tree_allreduce`, or
`broadcast` are added, they must follow this contract for the backend's
dispatch path to keep working.
Without an ADR-level contract:
- A new algorithm author has to infer the signature from ADR-0047 D3's
one-liner.
- The kernel-function argument order (especially `t_ptr, n_elem,
cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w,
sip_topo_h, tl`) is unclear without grep.
- It is conventional, but not documented, what `kernel_args` takes as
inputs and what tuple it must return.
## Decision
### D1. The algorithm module exposes four public symbols
```python
# src/kernbench/ccl/algorithms/<name>.py
from __future__ import annotations
# (required) topology-kind constants — referenced internally
SIP_TOPO_RING = 0
SIP_TOPO_TORUS = 1
SIP_TOPO_MESH = 2
# (optional) topology name → kind mapping. Used by the backend to
# translate ccl.yaml/topology's string SIP topology into an integer.
TOPO_NAME_TO_KIND = {
"ring_1d": SIP_TOPO_RING,
"torus_2d": SIP_TOPO_TORUS,
"mesh_2d_no_wrap": SIP_TOPO_MESH,
}
# (required) kernel argument builder
def kernel_args(world_size: int, n_elem: int, *, cube_w: int = 4, cube_h: int = 4) -> tuple:
return (n_elem, cube_w, cube_h, world_size)
# (required) kernel function (TLContext is injected via the `tl=...`
# keyword argument).
def my_allreduce_kernel(t_ptr, n_elem, cube_w, cube_h, n_sips,
sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, *, tl):
...
# (required) kernel alias — the backend accesses `module.kernel`
kernel = my_allreduce_kernel
```
- The `kernel` alias is the entry point the backend invokes. Whatever
the function name is (e.g., `allreduce_intercube_multidevice`), it
must be exposed via `module.kernel = fn`.
- Without `kernel_args`, the backend has no way to build the
algorithm's argument list. See D2 for the signature.
- If `TOPO_NAME_TO_KIND` is absent, the backend falls back to
`sip_topo_kind = 0`. An algorithm supporting only a single topology
may omit it.
### D2. `kernel_args` signature — `(world_size, n_elem, *, cube_w, cube_h)`
```python
def kernel_args(world_size: int, n_elem: int, *,
cube_w: int = 4, cube_h: int = 4) -> tuple:
return (n_elem, cube_w, cube_h, world_size)
```
- **Positional arguments**: `world_size` (= number of ranks), `n_elem`
(= element count of a single shard, f16-based).
- **Keyword arguments**: `cube_w`, `cube_h` (= cube-mesh dimensions).
Default 4×4 — aligned with `topology.yaml`'s `sip.cube_mesh` default.
- **Return**: a tuple in the order the kernel's positional arguments
expect.
When the backend calls `all_reduce`:
```python
kernel_args_tuple = self._algo_module.kernel_args(
self._world_size, n_elem, cube_w=eff_cube_w, cube_h=eff_cube_h,
)
extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)
pending = self.ctx.launch(
self._merged["algorithm"], kernel_fn, tensor,
*kernel_args_tuple, *extra_args, _defer_wait=True,
)
```
So the kernel's full positional argument list becomes: `(tensor_ptr,
*kernel_args_tuple, sip_rank, sip_topo_kind, sip_topo_w,
sip_topo_h)`, with `tl=...` injected as a keyword. The tuple length
and order returned by `kernel_args` must **match the kernel signature
1:1**.
### D3. Kernel signature — standardized 9 + tl arguments
Recommended signature:
```python
def my_kernel(
t_ptr: int, # VA base of the row-wise-sharded tensor on this SIP
n_elem: int, # element count per cube tile (or per shard)
cube_w: int, # cube mesh width (from kernel_args)
cube_h: int, # cube mesh height (from kernel_args)
n_sips: int, # equal to world_size (rank = SIP, ADR-0024)
sip_rank: int, # this SIP's rank
sip_topo_kind: int, # result of TOPO_NAME_TO_KIND lookup
sip_topo_w: int, # SIP mesh width (0 for ring_1d)
sip_topo_h: int, # SIP mesh height (0 for ring_1d)
*, tl, # TLContext (auto-injected)
) -> None:
```
Even if `kernel_args` chose a different positional argument order, the
kernel's **last four positional arguments are always
`(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)`** — the backend
appends them as `extra_args` (ADR-0047 D5). A custom algorithm must
accept these four, but a single-SIP algorithm may simply ignore them.
`tl` is injected via keyword — `RuntimeContext.launch` adds `tl=tl_ctx`
just before invoking the kernel. The signature therefore exposes `tl`
as keyword-only (`*, tl`) or as the trailing keyword parameter.
### D4. Kernel body — freedom and constraints
Available inside the kernel: every `tl.*` primitive from ADR-0046 D3.
Common patterns:
- `cube_id = tl.program_id(axis=1)` — this PE's cube index.
- `pe_addr = t_ptr + cube_id * nbytes` — per-cube VA of the tile.
- `acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")` — load local
data.
- `tl.send(dir=...)` / `tl.recv(dir=..., shape=, dtype=)` — IPCQ
collective.
- `acc = acc + recv` — TensorHandle arithmetic operators (ADR-0046 D4).
- `tl.store(pe_addr, acc)` — store the result.
The kernel body is plain Python — branching and loops are fine. But:
- No SimPy `yield` or `async` (ADR-0046 D1).
- No direct access to TensorHandle `.data` — the Phase 1 timing model
doesn't see data dependencies (ADR-0020's 2-pass separation).
- Kernel execution must be deterministic — the same input must produce
the same op sequence. No random or external IO.
### D5. SIP topology semantics — meaning of `sip_topo_kind`
The backend looks up `topology.yaml`'s `system.sips.topology` string
in the algorithm module's `TOPO_NAME_TO_KIND` and passes the integer
as `sip_topo_kind`. The algorithm then branches:
```python
if sip_topo_kind == SIP_TOPO_RING:
acc = _inter_sip_ring(...)
elif sip_topo_kind == SIP_TOPO_TORUS:
acc = _inter_sip_torus_2d(...)
elif sip_topo_kind == SIP_TOPO_MESH:
acc = _inter_sip_mesh_2d(...)
```
Each topology branch communicates with peers via IPCQ direction names
(`"global_E"`, `"W"`, `"S"`, `"N"` …). Direction semantics are defined
in ADR-0023/0025; `configure_sfr_intercube_multisip` installs the IPCQ
neighbor table accordingly.
If a topology kind not supported by the algorithm appears, prefer an
explicit `raise ValueError(f"unsupported topology kind
{sip_topo_kind}")` over a silent no-op — fail fast on misconfiguration.
### D6. The `ccl.yaml` algorithm entry
The algorithm module is paired with a `ccl.yaml` entry (ADR-0023 D10 +
ADR-0047 D3):
```yaml
defaults:
algorithm: lrab_hierarchical_allreduce
n_elem: 8
algorithms:
lrab_hierarchical_allreduce:
module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
# optional: world_size override
# optional: per-algorithm parameters consumed by configure_sfr_intercube_multisip
```
- `module`: the full Python module path; `importlib.import_module`
consumes this string as-is.
- `world_size` (optional): when set, overrides the topology fallback
(ADR-0047 D2).
- Algorithm-specific parameters are consumed by
`configure_sfr_intercube_multisip`.
Workflow to add a new algorithm:
1. Write `src/kernbench/ccl/algorithms/<name>.py` following D1.
2. Add the entry under `algorithms` in `ccl.yaml`.
3. (If needed) extend `kernbench.ccl.sfr_config` with the SFR-install
branch.
4. Add tests (e.g., `tests/sccl/test_<name>.py`, extending the
ADR-0043 eval harness).
### D7. Legacy "rank = flat PE index" mode
The `world_size` override in `ccl.yaml`, surfaced by ADR-0047 D2, is
used by legacy "rank = flat PE index" tests. The algorithm module can
assume `n_sips=world_size` ranks even in this mode — the backend
maintains the rank↔(SIP, cube, PE) mapping, so no modal branching is
needed inside the algorithm body.
In single-cube workloads (where `cube_w=cube_h=1`), the algorithm must
skip mesh-based phases — see the
`single_cube = (cube_w == 1 and cube_h == 1)` pattern in
`lrab_hierarchical_allreduce.py`.
## Alternatives Considered
### A1. Organize the algorithm module as a class (`class Allreduce: kernel(...)`)
Rejected. The Python module namespace already identifies an algorithm
(see ADR-0047 D3's `importlib.import_module`). A class wrapper adds
indirection without simplifying dispatch. Module-level free functions
plus a `kernel` alias are clean and obvious.
### A2. Type `kernel_args` with an explicit dataclass
Rejected (currently). Each algorithm normally has a different argument
count; forcing one dataclass would hurt cross-algorithm interchange.
The tuple return is simple and unpacks cleanly with the backend's
`*kernel_args_tuple`. If an algorithm wants stronger internal typing,
it may define its own NamedTuple.
### A3. Move SFR installation inside the algorithm module
Rejected. SFR installation
(`configure_sfr_intercube_multisip`) is a cross-module decision
combining topology + algorithm; `kernbench.ccl.sfr_config` is a more
natural home than the algorithm module itself. D6's "extend
sfr_config if needed" workflow keeps responsibility boundaries clear.
### A4. Auto-register algorithm names via a decorator (analogous to ADR-0045's `@bench`)
Rejected. Unlike benches, algorithms are already tied to `ccl.yaml`
entries; an additional registry would be redundant. The string mapping
in `module` is sufficient.
## Consequences
- ADR-0047 D3's one-line contract expands to a D1D7 author-facing
guide; new algorithm signatures no longer need to be grep-derived.
- D3's standardized 9 + tl signature couples naturally with the
backend's `extra_args` append (ADR-0047 D5). It is explicit that
even single-SIP-only algorithms must accept the four `sip_*` trailing
arguments.
- D5's fail-loud recommendation means a `ccl.yaml` topology that the
algorithm doesn't support will surface as an explicit `ValueError`
rather than a silent wrong result.
- D6's step-by-step addition workflow makes clear how far a new
algorithm has to reach into sfr_config / tests / ccl.yaml.
+288
View File
@@ -0,0 +1,288 @@
# ADR-0051: Routing Helper API — `AddressResolver` + `PathRouter`
## Status
Accepted (2026-05-22).
Pins down every public API, argument, return value, and adjacency-graph
selection of the two helper classes (`AddressResolver`, `PathRouter`)
exposed by `policy/routing/router.py`. ADR-0002 defines routing
distance, ordering, and bypass rules, but **the helper API surface
itself** has had no ADR-level coverage.
## First action
### `AddressResolver(graph)`
On construction, caches two pieces of state:
1. `self._node_ids = set(graph.nodes)` — a set of all node ids for
lookup.
2. `self._hbm_slice_bytes = hbm_total_gb * (1 << 30) // slices_per_cube`
— derived from `graph.spec.cube.memory_map` (default `48 GB / 8
slices = 6 GB`). `resolve()` uses this value to decode `pe_id` from
an HBM PA's `hbm_offset`.
In short, **AddressResolver's first act is "precompute the full set of
node ids and the HBM slice size"**. It does not retain the graph
itself.
### `PathRouter(graph)`
On construction, **builds four separate adjacency graphs in one pass**:
1. `self._adj_all`: every edge (used for component-to-component
routing).
2. `self._adj`: edges with `kind != "command"` (PE DMA / generic data
paths).
3. `self._adj_mcpu_dma`: excludes
`_MCPU_DMA_EXCLUDE = {"pe_internal", "pe_to_router"}` (M_CPU DMA
must not pass through PE pipeline nodes).
4. `self._adj_local`: excludes the 8-element `_UCIE_KINDS` set (UCIe
would look like a zero-distance bus to Dijkstra, which would prefer
it over the mesh — for cube-local routing this must be avoided).
Each graph is a `defaultdict(list)` of `(neighbor, weight)`. The
weight is `edge.routing_weight_mm or edge.distance_mm`.
In short, **PathRouter's first act is "classify topology edges into
four policy-specific adjacency lists simultaneously"**. Each `find_*()`
call picks the appropriate graph and runs Dijkstra.
## Context
`policy/routing/router.py` performs two responsibilities together:
- **Naming**: it is the sole owner of the topology naming convention
(`sip{S}.cube{C}.<comp>`, `sip{S}.io{I}.pcie_ep`, etc.). Components /
probe / IPCQ install / runtime API do not build node-id strings
themselves — they call helpers.
- **Path decisions**: policy separation by `edge.kind`. For the same
src→dst, different routing intents (PE DMA vs M_CPU DMA vs general
component routing) call for different adjacencies and so produce
different paths.
This helper API is widely consumed (probe.py / distributed.py /
install.py / various components / tests), yet **the exact signatures /
return semantics / adjacency picks** are not gathered in any ADR. This
ADR closes that gap.
## Decision
### D1. `AddressResolver` exposes five public methods
#### D1.1. `resolve(addr: PhysAddr) -> str`
Translates a `PhysAddr` to a destination node id in the topology:
```
addr.kind == "hbm" → f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
where pe_id = addr.hbm_offset // self._hbm_slice_bytes (ADR-0017 D4/D9)
addr.kind == "pe_resource":
addr.unit_type == PE → f"sip{s}.cube{d}.pe{addr.pe_id}.pe_tcm"
addr.unit_type == SRAM → f"sip{s}.cube{d}.sram"
addr.unit_type == MCPU → f"sip{s}.cube{d}.m_cpu"
others → RoutingError("unsupported unit_type")
other kinds → RoutingError("unsupported address kind")
```
If the derived node id is not in `self._node_ids`, raises
`RoutingError(f"node {node_id} not found in topology")`. So even when
the address has valid syntax, an absent node in the topology
fails-loud.
#### D1.2. `find_m_cpu(sip, cube) -> str`
Returns `f"sip{sip}.cube{cube}.m_cpu"`; absent → `RoutingError`.
#### D1.3. `find_pcie_ep(sip, io_id="io0") -> str`
Returns `f"sip{sip}.{io_id}.pcie_ep"`; absent → `RoutingError`.
#### D1.4. `find_io_cpu(sip, io_id="io0") -> str`
Returns `f"sip{sip}.{io_id}.io_cpu"`; absent → `RoutingError`.
#### D1.5. `find_all_pcie_eps() -> list[str]`
All PCIE_EP node ids across all SIPs, sorted. Filtered by
`endswith(".pcie_ep")`. Cross-SIP IPCQ uses this when enumerating
PCIE_EPs.
This class is the sole owner of the naming convention
(`sip{S}.cube{C}.<comp>`, `sip{S}.{io_id}.<comp>`) — ADR-0015 D4.
The topology builder produces nodes with the same naming convention;
components never build node-id strings directly — they go through
these helpers.
### D2. `PathRouter`'s four adjacency graphs
Constructed in one pass. `edge.kind` drives policy:
| graph | excluded edge kinds | use case |
|-------------------|--------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|
| `_adj_all` | (none) | M_CPU↔NOC command included, IO_CPU/M_CPU routes |
| `_adj` | `"command"` | PE DMA / generic data paths |
| `_adj_mcpu_dma` | `"pe_internal"`, `"pe_to_router"` | M_CPU DMA (skips PE pipeline) |
| `_adj_local` | `_UCIE_KINDS` (`ucie_internal`, `ucie_conn_to_router`, `router_to_ucie_conn`, `ucie_conn_to_noc`, `noc_to_ucie_conn`, `ucie_mesh`, `io_to_cube`, `cube_to_io`) | same-cube routing (UCIe bus excluded) |
Each graph is `dict[node_id, list[(neighbor, weight)]]` with weight =
`edge.routing_weight_mm or edge.distance_mm`. Excluding command edges
prevents them from influencing routing; isolating `_adj_local` keeps
UCIe's "zero-distance bus" from out-competing the mesh — consistent
with ADR-0017 D7's cross-PE-slice mesh-distance requirement.
### D3. `PathRouter` exposes six public methods (+ two backward-compat shims)
#### D3.1. `find_path(src_pe: str, dst_node: str) -> list[str]`
**PE DMA routing**. `src_pe` is a PE prefix (e.g.,
`"sip0.cube0.pe0"`); the function auto-prepends `.pe_dma`, making the
true start node `"sip0.cube0.pe0.pe_dma"`.
Adjacency depends on cube-locality (`_same_cube`):
- **Same-cube** (src and dst share `sip{S}.cube{C}.` prefix): uses
`_adj_local`. Excluding UCIe lets cross-PE-slice access pay accurate
mesh distance (ADR-0017 D7).
- **Cross-cube**: uses `_adj`. UCIe naturally becomes the right choice
for the cross-cube portion.
#### D3.2. `find_path_with_distance(src_pe, dst_node) -> tuple[list[str], float]`
Same adjacency policy as D3.1, but returns `(path, total_distance)`.
Used by probe and analysis tools that need the distance metric.
#### D3.3. `find_mcpu_dma_path(m_cpu_id: str, dst_hbm_id: str) -> list[str]`
**M_CPU DMA path**. Same cube → `_adj_local` (stay within the mesh);
different cube → `_adj_all` (cross via UCIe). The
`_MCPU_DMA_EXCLUDE` set ensures PE-pipeline nodes never appear on
M_CPU's routes.
#### D3.4. `find_memory_path(src: str, dst: str) -> list[str]`
Direct memory path like
`pcie_ep → io_noc → cube → router mesh → hbm_ctrl`. Uses
`_adj_mcpu_dma` to exclude `pe_internal` and `pe_to_router`, so
host-issued reads/writes never leak into the PE pipeline. Probe
(ADR-0049 D1's H2D/D2H cases) calls this directly.
#### D3.5. `find_node_path(src: str, dst: str) -> list[str]`
Generic routing between arbitrary nodes, **including command edges**
(via `_adj_all`). IoCpuComponent / MCpuComponent use this when they
need to route through M_CPU ↔ NOC command-kind links.
#### D3.6. Backward-compat shims
- `_dijkstra(start, goal) -> list[str]` — thin wrapper for
`_run_dijkstra(self._adj, …)`.
- `_dijkstra_with_dist(start, goal) -> tuple[list[str], float]`
distance-aware variant.
Despite the underscore prefixes (suggesting internal API), existing
tests call these directly. New code should prefer D3.1D3.5; these two
shims are deprecation candidates.
### D4. Dijkstra — single-source shortest path
`_run_dijkstra_with_dist(adj, start, goal)`:
- `heapq` priority queue.
- `best: dict[node, distance]` — best known distance to each node.
- `prev: dict[node, predecessor]` — for path reconstruction.
- Edge weight = `routing_weight_mm or distance_mm`. The separation
matters because UCIe (and a few others) declare an explicit
`routing_weight_mm` distinct from physical `distance_mm`.
`start == goal` short-circuits to `([start], 0.0)`. Unreachable target
`RoutingError(f"no path from {start} to {goal}")`.
The algorithm is **deterministic**: identical graph + start/goal gives
the same path, satisfying SPEC R1 ("routing MUST be deterministic").
Tie-breaks follow `heapq`'s push order (Python list order is
deterministic).
### D5. Single-owner principle for helper-API decisions
The following decisions live only inside router.py:
- Naming convention: `sip{S}.cube{C}.<comp>`,
`sip{S}.{io_id}.<comp>`,
`sip{S}.cube{C}.hbm_ctrl.pe{pe_id}`.
- Adjacency policy: which edge kinds belong to which graph.
- Algorithm for recovering PE id from an HBM slice size.
- Dijkstra weight selection
(`routing_weight_mm or distance_mm`).
Breaking single ownership (e.g., a component starting to build
`f"sip{s}..."` itself) would explode the blast radius of naming-
convention changes. This aligns with ADR-0015 D4.
### D6. Consumers of the helper API
Methods listed in this ADR are called from (current corpus):
- `probes/probe.py` (ADR-0049): `find_pcie_ep`, `find_io_cpu`,
`find_m_cpu`, `find_node_path`, `find_mcpu_dma_path`,
`find_memory_path`, `find_path`, `resolve`.
- `runtime_api/distributed.py` (ADR-0047): indirectly (engine-internal
routing).
- `ccl/install.py` (ADR-0023): `find_all_pcie_eps`, `resolve`.
- `sim_engine/event_log.py`: like probe — `find_pcie_ep`,
`find_memory_path`.
- `components/builtin/m_cpu.py`, `components/builtin/io_cpu.py`:
`find_node_path`, `find_mcpu_dma_path`.
- Tests (test_routing.py, test_cross_sip_routing.py, …): most of
D3.1D3.5.
When a new consumer arrives, D1/D3 act as a first-pass guide on
whether an existing method matches the intent or a new one is needed.
## Alternatives Considered
### A1. One adjacency graph + per-call edge-kind filtering
Rejected. Re-filtering the graph on every `find_*()` call hurts
Dijkstra cache locality. Constructing four graphs in one pass (D2)
has modest memory cost (edges ≤ a few × 10⁴), and selection happens
in O(1) at call time.
### A2. Drive adjacency separation by separate edge metadata rather than `kind`
Rejected. `edge.kind` is already assigned by the topology builder
(ADR-0015 D4 + ADR-0017); a parallel metadata field would force
synchronization between two systems.
### A3. Use BFS with uniform weights instead of Dijkstra
Rejected. With per-edge `routing_weight_mm` (mesh link / UCIe /
IO-internal), BFS minimizes hop count rather than total
latency/distance. SPEC R1 + R2 require deterministic and accurate
routing, which BFS does not deliver.
### A4. Express the helper API as module functions instead of classes
Rejected. Each class
(`AddressResolver`, `PathRouter`) maintains caches
(`_node_ids`, `_hbm_slice_bytes`, four adjacency graphs) reused across
many routing queries on the same graph. Module functions would have
to rebuild state per call or go global, hurting safety and
performance.
## Consequences
- When components / probe / IPCQ install / runtime API all go through
router.py helpers, a naming-convention change (e.g., `.io0.`
`.iochiplet0.`) is a one-file edit (D5).
- D2's four-graph split is now ADR-locked, so when a new edge kind is
added (e.g., a new inter-die UCIe-link kind), the right adjacency
category is decided explicitly rather than by default.
- D3.1's same-cube vs cross-cube branching (ADR-0017 D7) is explicit,
so anyone changing routing knows which adjacency to touch.
- D6's consumer list bounds PR-review scope for helper-API changes,
and the backward-compat shims (D3.6) are flagged as deprecation
candidates.
@@ -0,0 +1,371 @@
# ADR-0052: OpLog + MemoryStore Schemas — sim_engine internals
## Status
Accepted (2026-05-22).
Pins down the `OpRecord` schema and the `record_start` / `record_end` /
`record_copy` behavior in `sim_engine/op_log.py`, plus the
(space, addr) namespace and read/write semantics of `MemoryStore` in
`sim_engine/memory_store.py`. ADR-0020 (2-pass data execution) declares
that these two facilities exist, but **the precise record fields and
semantics** had no ADR-level coverage, and several recent ADRs
(ADR-0046 D3.2's `tl.store` visibility, ADR-0023 D9's IPCQ copy
record) depend on these semantics.
## First action
### `OpLogger(memory_store=None)`
On construction, initialize three fields:
1. `self._records: list[OpRecord] = []` — accumulated records.
2. `self._pending: dict[int, dict] = {}` — partial records keyed by
`id(msg)` (created at `record_start`, completed at `record_end`).
3. `self._memory_store = memory_store` — optional MemoryStore
reference. Used to capture math-op input snapshots and dma_write
HBM-source snapshots.
Records and pending are empty; the `record_*` calls accumulate data
over time.
### `MemoryStore()`
On construction, initialize a single field:
`self._storage: dict[str, dict[int, np.ndarray]] = {}` — a two-level
dict (`space → addr → ndarray`). Inner dicts are created lazily as new
spaces appear.
In short, **both facilities' first act is "set up an empty accumulator
buffer plus a sparse, per-space dict"**. The first record / write
fills the fields when it arrives.
## Context
ADR-0020 D2/D5/D7 (2-pass data execution) declares:
- During Phase 1 (timing), `ComponentBase._on_process_start/end` hooks
call `OpLogger.record_start/end`, recording the time and metadata of
every data op.
- Phase 2 (data) replays the op log in `t_start` order to compute real
data.
- Data payloads live in `MemoryStore`, keyed by (space, addr).
Subsequent ADRs (ADR-0023 D9's IPCQ atomic write, ADR-0027's Megatron
TP scratch-overwrite avoidance, ADR-0046 D3.2's `tl.store` visibility)
depend on op_log and MemoryStore behavior, but **the exact record
fields / space names / snapshot timing** are only discoverable via
source grep. This ADR codifies them.
## Decision
### D1. `OpRecord` schema — seven fields
```python
@dataclass
class OpRecord:
t_start: float
t_end: float
component_id: str
op_kind: str # "memory" | "gemm" | "math" | "unknown"
op_name: str # e.g. "dma_read", "gemm_f16", "exp",
# "TileToken/DMA_READ", "composite_gemm",
# "ipcq_copy"
params: dict[str, Any]
dependency_ids: list[int] = field(default_factory=list)
```
- **`t_start` / `t_end`**: SimPy time (float ns). `t_start` is when the
component begins the op; `t_end` is completion. Duration =
`t_end - t_start`.
- **`component_id`**: the node id where the op occurred (e.g.,
`"sip0.cube0.pe0.pe_dma"`).
- **`op_kind`**: one of four. Phase 2 DataExecutor branches on this.
- **`op_name`**: a debug/analysis-friendly name. For a TileToken,
expands to `"TileToken/{stage_type}"` (e.g.,
`"TileToken/DMA_READ"`) to disambiguate stages.
- **`params`**: op-specific metadata dict (see D3).
- **`dependency_ids`**: currently unused (default `[]`). Reserved for
future cross-op dependency tracking.
### D2. `OpLogger.records` — guaranteed `t_start` sort
```python
@property
def records(self) -> list[OpRecord]:
self._records.sort(key=lambda r: r.t_start)
return self._records
```
A stable sort by `t_start` runs on each access. Records with the same
`t_start` preserve insertion order. Aligns with ADR-0020 D5's
"t_start stable ordering" requirement.
Phase 2 DataExecutor always accesses via the `records` property, so
even when `record_end` calls arrive out of `t_start` order (e.g., a
short op started later but finished earlier), the sequence handed to
Phase 2 is consistent.
### D3. `params` schema per `op_name` (matrix from `_extract_op_info`)
#### D3.1. `op_kind="memory", op_name="dma_read"` (DmaReadCmd)
```python
{"src_addr": int, "nbytes": int, "handle_id": str}
```
#### D3.2. `op_kind="memory", op_name="dma_write"` (DmaWriteCmd)
```python
{
"src_space": str, # handle.space ("tcm"|"hbm"|"sram"), default "tcm"
"src_addr": int, # handle.addr
"shape": tuple, "dtype": str,
"dst_space": "hbm", # DmaWrite always targets HBM
"dst_addr": int,
"nbytes": int,
"handle_id": str,
# When src_space == "hbm" at record_end, a snapshot is added (D4)
"snapshot": np.ndarray | None,
}
```
#### D3.3. `op_kind="gemm", op_name=f"gemm_{dtype_a}"` (GemmCmd)
```python
{
"src_a_addr": int, "src_b_addr": int, "dst_addr": int,
"shape_a": tuple, "shape_b": tuple, "shape_out": tuple,
"dtype_in": str, "dtype_out": str,
"m": int, "k": int, "n": int,
# ADR-0027: per-operand + output spaces preserved
"src_a_space": str, "src_b_space": str, "dst_space": str,
}
```
#### D3.4. `op_kind="math", op_name=msg.op` (MathCmd; op = "exp", "sum", "add", "where", …)
```python
{
"input_addrs": list[int], # addrs of input handles
"input_shapes": list[tuple],
"input_spaces": list[str],
"input_dtypes": list[str],
"dst_addr": int, "dst_space": str,
"shape_out": tuple, "dtype": str,
"axis": int | None, # only meaningful for reductions
# All inputs get snapshots at record_end (D4)
"input_snapshots": list[np.ndarray | None],
}
```
#### D3.5. `op_kind="gemm" or "math", op_name=f"composite_{op}"` (CompositeCmd)
```python
{
"op": str, # "gemm" | "math"
"out_addr": int, "out_nbytes": int,
# If op == "gemm", same fields as GemmCmd are added:
"src_a_addr": int, "src_b_addr": int,
"shape_a": tuple, "shape_b": tuple,
"dtype_in": str, "dtype_out": str,
"src_a_space": str, "src_b_space": str,
"dst_space": "hbm", "dst_addr": int, # = out_addr
}
```
If `op == "gemm"`, `op_kind = "gemm"`; otherwise `"math"`. An alias so
Phase 2 replays composite-gemm on the same path as `GemmCmd`.
#### D3.6. `op_kind="memory", op_name="ipcq_copy"` (record_copy path)
```python
{
"src_space": str, "src_addr": int,
"dst_space": str, "dst_addr": int,
"shape": tuple, "dtype": str, "nbytes": int,
"snapshot": np.ndarray | None, # passed by caller; if None, record_copy reads fresh
}
```
`PE_DMA._handle_ipcq_inbound` (ADR-0023 D9) emits this record so Phase
2 can replay the IPCQ slot's inbound copy. It bypasses
`record_start` / `record_end` and pushes directly via `record_copy()`.
#### D3.7. `op_kind="unknown", op_name=type(msg).__name__`
Fallback for messages `_extract_op_info` doesn't recognize. `params =
{}`. If DataExecutor encounters this kind, it skips — Phase 2 replay
is unaffected.
### D4. Snapshot capture timing
When `OpLogger._memory_store` is set, `record_end` performs:
- **Math op**: read every input
(addr/shape/space/dtype) from `self._memory_store.read(...)` and
attach an ndarray copy to `params["input_snapshots"]`. Read failure
`None`.
- **`dma_write` op**: snapshot the source **only if `src_space ==
"hbm"`** and attach to `params["snapshot"]`. TCM (PE scratch)
sources are **deliberately skipped** — TCM is repopulated by Phase 2
math/gemm replay, and a Phase-1-time snapshot would capture a
previous kernel's stale value (ADR-0027 postmortem: TP gemm →
all_reduce race).
- **`ipcq_copy`**: the caller passes the in-flight snapshot via
`snapshot=token.data`. If absent, `record_copy` attempts a fresh
read from MemoryStore.
Snapshots are taken with `.copy()` (fresh allocation), making them
safe against later storage mutation. This is the foundation of
ADR-0027's "cross-PE Phase 2 ordering" race-avoidance.
When `memory_store` is `None` (Phase 1 timing-only mode), all
snapshot steps are skipped. Only the timing portion of the record is
preserved; data replay is unavailable.
### D5. TileToken handling — `record_start` captures stage info
ADR-0014 D6's self-routing tile token (pipeline mode) may have already
advanced its `stage_idx` by the time `record_end` runs (the TileToken
caches the next stage's params as it moves to the next component).
Therefore:
`record_start` pre-saves the following in `pending[id(msg)]["snap"]`:
```python
snap["stage_type"] = stage.stage_type.name # "DMA_READ", "GEMM", ...
snap["stage_params"] = dict(stage.params) # copy of params at start time
```
`record_end` retrieves this snap and merges into params:
- Adds `params["stage_type"]` to final params.
- Merges `stage_params` keys (keeps existing values if any).
- If `op_name == "TileToken"`, rewrites it to
`f"TileToken/{stage_type}"` (e.g., `"TileToken/DMA_READ"`),
disambiguating different stages emitted by the same component.
Thanks to this, DMA_READ vs DMA_WRITE, FETCH vs STORE coming from the
same component (e.g., pe_dma) are distinguishable in reports.
### D6. `MemoryStore` — two-level (space, addr) dict
```python
class MemoryStore:
def __init__(self) -> None:
self._storage: dict[str, dict[int, np.ndarray]] = {}
def write(self, space, addr, data): self._storage[space][addr] = data
def read(self, space, addr, shape=None, dtype=None) -> np.ndarray: ...
def has(self, space, addr) -> bool: ...
def snapshot(self) -> MemoryStore: ...
```
#### D6.1. Space namespace
A string key. Standard values:
- `"hbm"`: HBM data (deploy_tensor + Phase 2 dma_write results).
- `"tcm"`: PE-local TCM (Phase 2 math/gemm output).
- `"sram"`: cube-level SRAM (ADR-0023 D9.7's IPCQ slot tier).
Other spaces (e.g., `"reg"`) are allowed — `_storage` is a lazy dict
that creates a new space when `write` first touches it.
#### D6.2. Address keying
`addr` is an integer. It may be a **physical address (PA) or a virtual
address (VA)** — `MemoryStore` itself doesn't know address-space
semantics; it just uses them as keys. Phase 1's `MemoryWriteMsg`
writes both PA and VA
(`_create_tensor` zero-inits at PA and at the VA base too); Phase 2
reads/writes via the addresses captured by op_log.
The caller decides `addr`'s meaning — `MemoryStore` provides only
lookup.
#### D6.3. read/write semantics — reference store (no copy)
`write(space, addr, data)`: stores the ndarray reference. **No copy.**
If the caller later mutates the same ndarray, the stored value
changes.
`read(space, addr, shape=None, dtype=None)`: returns the stored
ndarray reference. If `shape`/`dtype` are provided:
- `dtype != stored.dtype`: `arr.view(np_dtype)` reinterprets as a
view (no copy).
- `shape != stored.shape`: if `nbytes` matches, `arr.reshape(shape)`
is a view.
- `nbytes` mismatch → `ValueError`.
To detach the data, the caller must call `arr.copy()`. ADR-0027's
race-avoidance requires explicit `.copy()` in op_log snapshot steps
for exactly this reason.
#### D6.4. `has(space, addr) -> bool`
Existence check; does not materialize data.
#### D6.5. `snapshot() -> MemoryStore`
Shallow copy. Creates a new instance of inner dicts but shares
ndarray references. Used at Phase 2 init to fork from Phase 1's
store, so Phase 2 mutations don't affect Phase 1's remaining
consumers.
### D7. op_log assumes a single-threaded SimPy
`OpLogger`'s `_records` and `_pending` are lock-free. SimPy is
single-threaded, so nothing else can intrude between `record_start`
and `record_end` for the same message.
When multi-process kernbench (ADR-0047 D6) arrives, OpLogger must be
split per process — one OpLogger instance cannot receive records from
multiple processes.
## Alternatives Considered
### A1. Externalize op_log to SQLite / parquet
Rejected (currently). The in-memory list minimizes Phase 1 → Phase 2
hand-off latency. Externalization makes sense for long-running batch
runs but adds overhead for the current single-run workload.
### A2. Capture snapshots at `record_start`
Rejected. At `record_start`, inputs are often not yet populated (e.g.,
a math op's input is the output of a just-issued previous op).
`record_end` is the correct point.
### A3. Per-component MemoryStore
Rejected. The (space, addr) key already disambiguates effectively, and
splitting per component would complicate cross-PE IPCQ copy (ADR-0023
D9), which needs access to both source and destination stores.
### A4. Explicit dependency edges in op_log
Partially adopted. The `dependency_ids` field exists on `OpRecord` but
is currently unused (D1). Phase 2 DataExecutor orders via `t_start` +
a secondary sort (memory ops before math at the same `t_start`). When
an explicit dependency graph is required, this field is the home.
Current ordering rules are sufficient, so it remains unused.
## Consequences
- ADR-0020's op_log / MemoryStore declarations are expanded into the
concrete D1D6 schemas, so writing/modifying Phase 2 DataExecutor
doesn't need source-grep to learn field semantics.
- D3's per-`op_name` params matrix makes adding new ops (e.g., a new
reduction type) a question of branching in `_extract_op_info`.
- D4's per-op snapshot policy (math = input snapshot, dma_write =
HBM-only snapshot) is ADR-locked, so ADR-0027's race-avoidance
decision won't silently regress on future refactors.
- D6.3's reference-store semantics are explicit, putting mutation
safety on the caller. ADR-0027's explicit `.copy()` pattern is
justified.
- D7's single-thread assumption is recorded, so multi-process
kernbench (ADR-0047 D6's supersession candidate) will need OpLogger
separation when introduced.
@@ -0,0 +1,351 @@
# ADR-0053: Topology Builder + Visualizer Algorithms
## Status
Accepted (2026-05-22).
Pins down the key algorithmic choices of the topology compile and
visualization pipeline jointly implemented by `topology/builder.py`,
`topology/mesh_gen.py`, and `topology/visualizer.py`
placement-driven router attachment, mesh auto-layout, the source_hash
cache, view projections, and SVG rendering. ADR-0006 defines the
high-level intent of topology compilation (compiled topology, distance
extraction, automatic diagram generation), but **which algorithms the
builder actually uses** was only discoverable via source grep.
## First action
When `resolve_topology(path_str)` is called, four steps run in order:
1. **Path validation** (`builder.py::resolve_topology`):
`Path(path_str).expanduser().resolve()`, existence check, file
check. Failure → `FileNotFoundError` or `ValueError`.
2. **YAML parsing** (`_read_spec`): `yaml.safe_load`. Parse errors
yield a `ValueError` with line/column. Non-dict roots are
rejected.
3. **Auto-generate the mesh** (`mesh_gen.ensure_mesh_file`): create or
reuse a `cube_mesh.yaml` next to the topology file. Cache hit on
matching source_hash; miss triggers regeneration. This step decides
the cube NoC's router grid and attachment information.
4. **Compile the graph** (`_compile_graph`): system → IO chiplets →
cubes → inter-cube edges → IO↔cube edges → system↔IO edges, then
build four view projections (system, sip, cube, pe) and wrap into
a `TopologyGraph`.
In short, **topology compilation's first act is "read topology.yaml as
a dict, create/validate cube_mesh.yaml in the same directory, then
build the flat graph + 4-view projection in system → sip → cube → pe
order"**.
## Context
`topology/` package responsibilities:
- **builder.py** (1207 lines): turns topology.yaml into a
`TopologyGraph` (nodes + edges + 4 view projections).
- **mesh_gen.py** (305 lines): auto-decides the cube NoC's router
grid and PE/UCIe/M_CPU/SRAM attachment positions and caches them in
`cube_mesh.yaml`.
- **visualizer.py** (887 lines): generates four SVG diagrams (system /
sip / cube / pe) from a `TopologyGraph`.
ADR-0006 makes the high-level decision that "the result of topology
compilation is the single source for distance metadata and diagram
generation", but specific algorithms (e.g., placement-driven nearest-
router attachment, the HBM exclusion zone, which fields in source_hash
trigger regeneration) are not in any ADR.
In particular, these decisions are absent at ADR level:
- Why is mesh_gen cached in a separate file (`cube_mesh.yaml`)?
- Which fields are in source_hash, and which changes force
regeneration?
- Why placement coordinates in mm rather than cube coordinates?
- How are the HBM exclusion zone and UCIe N/S/E/W distribution
decided inside the mesh?
- What is the abstraction-level difference among the four view
projections (system/sip/cube/pe)?
This ADR captures these decisions in one place.
## Decision
### D1. Compile pipeline — six stages
`_compile_graph(spec)`:
1. **System nodes** (`_instantiate_system`): add system-level nodes
like `fabric.switch0` and the host CPU.
2. **Per-SIP loop** (`for sip_id in range(system.sips.count)`):
- **IO chiplets** (`_instantiate_io_chiplets`): create pcie_ep /
io_cpu / io_noc / io_ucie PHYs / conn nodes and their bidirectional
internal edges.
- **Cube instantiation** (`_instantiate_cube`): using
cube_mesh.yaml's router grid, instantiate cube routers, PE
sub-components (pe_cpu, pe_dma, pe_fetch_store, pe_gemm, pe_math,
pe_mmu, pe_tcm, pe_scheduler, pe_ipcq), m_cpu, sram, hbm_ctrl,
and their internal edges.
- **Inter-cube edges** (`_add_inter_cube_edges`): the UCIe
N/S/E/W mesh edges.
- **IO ↔ cube edges** (`_add_io_to_cube_edges`): connect io_noc to
each cube's edge UCIe phy.
3. **Switch ↔ IO edges** (`_add_system_to_io_edges`): bidirectional
edges between `fabric.switch0` and each SIP's `pcie_ep` (the
cross-SIP IPCQ path of ADR-0038 D3 + ADR-0010).
4. **Build four view projections**:
- `_build_system_view(spec)` — Tray level: SIPs and the system
switch.
- `_build_sip_view(spec)` — inside one SIP: cube mesh + IO
chiplet.
- `_build_cube_view(spec)` — inside one cube: router grid + PE /
M_CPU / SRAM / HBM_CTRL attachments.
- `_build_pe_view(spec)` — inside one PE: nine sub-components +
internal edges (pe_internal kind).
5. **Return `TopologyGraph`**: `TopologyGraph(spec, nodes, edges,
system_view, sip_view, cube_view, pe_view)`.
The six stages are **ordered for a reason**: only after cubes exist
do inter-cube edges have valid src/dst, and IO chiplets must precede
the IO ↔ cube edges that reference them. New node types must slot in
the right spot.
### D2. `cube_mesh.yaml` — a separate file with a source_hash cache
`mesh_gen.ensure_mesh_file(cube_spec, mesh_path)`:
1. Compute `source_hash = _compute_source_hash(cube_spec)` from these
input fields:
- `geometry` (cube_mm.w/h …).
- `pe_layout` (corners, pe_per_corner).
- `ucie.n_connections`.
- `memory_map.hbm_mapping_mode`.
- `placement` (m_cpu/sram pos_mm).
2. If `mesh_path` (= `cube_mesh.yaml` next to topology.yaml) exists
and `existing.source_hash == source_hash`, reuse it (cache hit).
3. Otherwise, generate a new mesh via
`_generate_mesh(cube_spec, source_hash)` and write to yaml.
Caching as a separate file because:
- Mesh generation involves nontrivial PE/UCIe/router attachment math
and is too expensive to redo every time.
- Multiple runs with the same cube spec must guarantee an identical
mesh.
- The resulting mesh is itself an inspectable / debuggable artifact.
The five fields listed in source_hash are the ones that determine
mesh shape; other changes (e.g., bandwidth, overhead_ns) do not
trigger mesh regeneration.
### D3. Cube NoC mesh auto-layout
`_generate_mesh(cube_spec)`:
#### D3.1. Rows / columns
- `pe_positions = _corner_pe_positions(cube_w, cube_h)`: PE-center
coordinates (mm) per corner (NW/NE/SW/SE). Hardcoded patterns like
`(1.5, 1.5)` and `(cube_w-1.5, cube_h-1.5)`; with `pe_per_corner=2`,
each corner has two PE positions.
- `col_xs = _compute_col_positions(...)`: union of PE x-coordinates,
plus relay columns inserted when any gap exceeds
`max_spacing = 3.0 mm`.
- `row_ys, rows_per_half = _compute_row_positions(cube_h,
n_connections, pe_positions)`:
- `n_conn = max(n_connections, 2)` (hot-path minimum).
- `rows_per_half = ceil(n_conn / 2)`.
- Top half + two HBM rows + bottom half. HBM sits at
`(cube_h/2 - 1.5, cube_h/2 + 1.5)`. The gap between PE rows and
HBM rows is `hbm_gap = 1.5 mm`.
#### D3.2. HBM exclusion zone
`hbm_row_start = rows_per_half`,
`hbm_row_end = rows_per_half + 1`.
`hbm_col_start = n_cols // 2 - 1`,
`hbm_col_end = n_cols // 2`.
Router slots inside this (row, col) rectangle are marked `None` (no
router). HBM controllers are added separately as
`hbm_ctrl.pe{X}` nodes following ADR-0017 D9's per-PE partition
pattern.
#### D3.3. PE attachment
Each corner's PEs map to a row:
- Top half: NW → row 0, NE → row 1 (top_corners index).
- Bottom half: SW → row `hbm_row_end + 1`, SE → row
`hbm_row_end + 2`.
Each PE's x-coordinate attaches to the nearest column's router
(`min(range(n_cols), key=lambda c: abs(col_xs[c] - pe_x))`).
Attachment items are `pe{pe_idx}.dma`, `pe{pe_idx}.cpu`,
`pe{pe_idx}.hbm` (pushed into the router's attach list).
#### D3.4. M_CPU / SRAM attachment — nearest router by Euclidean distance
For `placement.m_cpu.pos_mm` (default `[1.5, 5.5]`) and
`placement.sram.pos_mm` (default `[1.5, 8.5]`), find the router with
the smallest Euclidean distance and append `"m_cpu"` / `"sram"` to
its attach list.
#### D3.5. UCIe N/S/E/W distribution
`ucie_pe_rows = top_pe_rows + bot_pe_rows` (total
`2 * rows_per_half`).
- UCIe-E: one PE row at a time, attach `ucie_e.c{i}` to the rightmost
column's router.
- UCIe-W: attach `ucie_w.c{i}` to the leftmost column's router (E's
mirror).
- UCIe-N/S: split PE columns into left and right halves; attach to
the top row's / bottom row's matching columns.
Each UCIe connection is suffixed `c{i}`, distributing
ucie_n_connections PHYs (ADR-0017 D5+).
### D4. Node naming convention — single ownership
builder.py creates nodes with the following naming convention (the
single-owner principle from ADR-0051 D5):
- `fabric.switch0` — system-level switch.
- `sip{S}.{io_id}.{pcie_ep|io_cpu|io_noc|io_ucie.{dir}|conn.{id}}` —
IO chiplet.
- `sip{S}.cube{C}.{m_cpu|sram|hbm_ctrl.pe{X}|noc.r{R}c{C}|...}` —
inside cube.
- `sip{S}.cube{C}.pe{P}.{pe_cpu|pe_dma|pe_fetch_store|pe_gemm|pe_math|pe_mmu|pe_tcm|pe_scheduler|pe_ipcq}` —
PE sub-components.
Changing this convention requires updating both builder.py and
router.py's helpers (ADR-0051). Components never know the convention
directly — they only call the helpers.
### D5. Edge `kind` classification
Every edge gets a `kind`; routing policy (ADR-0051 D2) reads it. Major
kinds:
- `"pe_internal"` — within a PE between sub-components.
- `"pe_to_router"` — PE_DMA ↔ cube NoC router.
- `"router_mesh"` — between cube NoC routers.
- `"router_to_hbm"`, `"router_to_mcpu"`, `"router_to_sram"`,
`"sram_to_router"`, etc. — between cube-attached components.
- `"ucie_internal"`, `"ucie_conn_to_router"`,
`"router_to_ucie_conn"`, `"ucie_conn_to_noc"`,
`"noc_to_ucie_conn"`, `"ucie_mesh"` — UCIe-related.
- `"io_internal"` — inside IO chiplet.
- `"io_to_cube"`, `"cube_to_io"` — at the IO ↔ cube boundary.
- `"pcie"` — switch ↔ pcie_ep.
- `"command"` — control-plane edges only (e.g., M_CPU ↔ NOC; excluded
from PE DMA paths).
Adding a new edge kind requires picking a category in router.py's
four adjacency graphs (ADR-0051 D2). If you forget, it defaults to
`_adj_all` only, which can produce unintended routes.
### D6. View projection — four abstraction levels
`TopologyGraph` keeps four view projections alongside the flat
nodes+edges:
- **system_view** (`_build_system_view`): Tray level. SIP blocks and
`fabric.switch0`. PCIe links shown. For external high-level
overview.
- **sip_view** (`_build_sip_view`): inside one SIP — cube mesh + IO
chiplet (pcie_ep + io_cpu + io_noc). UCIe N/S/E/W appear as
cube-cube links.
- **cube_view** (`_build_cube_view`): inside one cube — router grid +
PE / M_CPU / SRAM / HBM_CTRL attachments + UCIe PHY edges. For
intra-cube routing / placement debugging.
- **pe_view** (`_build_pe_view`): inside one PE — nine sub-components
+ internal edges (pe_internal kind). For detailed PE-internal
dataflow review.
Views are selectively rendered via the spec's
`visualization.emit_views: [system, sip, cube]` (ADR-0006). The pe
view is omitted from default output but the code is retained for
detailed debugging.
### D7. visualizer.py — SVG diagram output
`emit_diagrams(graph, out_dir)` renders every view as SVG. Key
functions:
- `_render_view_svg(view)` — generic view render (no router grid).
- `_render_cube_view_svg(view, spec)` — cube-view specific (HBM block,
router grid layout, PE/M_CPU/SRAM/HBM placement).
- `_draw_node`, `_draw_edge` — node/edge visual representation.
- `_pick_scale`, `_compute_node_sizes` — auto-scaling.
The visualizer is a **derived artifact** (ADR-0006); changes here do
not pass production checks. Aligns with CLAUDE.md's "Derived
Artifacts" guidance.
### D8. Blast radius of spec changes
| spec field | effect | mesh regenerated? |
|---------------------------------------|---------------------|-------------------|
| `system.sips.count` | SIP count, node count | No |
| `sip.cube_mesh.w/h` | cube mesh shape | No |
| `cube.geometry.cube_mm.w/h` | cube size (mm) | **Yes** |
| `cube.pe_layout.corners/pe_per_corner`| PE attachment positions | **Yes** |
| `cube.ucie.n_connections` | UCIe PHY distribution | **Yes** |
| `cube.memory_map.hbm_mapping_mode` | HBM distribution mode | **Yes** |
| `cube.placement` | M_CPU/SRAM positions | **Yes** |
| `cube.memory_map.*` (besides above) | HBM capacity / BW | No |
| `*.links.*.bw_gbs` | edge bandwidth | No |
| `*.attrs.overhead_ns` | component latency | No |
The table mirrors D2's `_compute_source_hash` inputs. Changes that
require mesh regeneration automatically invalidate `cube_mesh.yaml`'s
source_hash.
## Alternatives Considered
### A1. Regenerate the mesh on every compile without a cache file
Rejected. The cost of mesh generation would be paid repeatedly (CLI
runs, probe, tests) for the same spec, and the human-inspectable
artifact would disappear.
### A2. Merge mesh generation into builder.py
Rejected (currently). It is a 305-line algorithm of its own, and the
mesh-layout decisions (placement-driven router attachment, HBM
exclusion zone) are different from builder's general node/edge
emission. Keeping it separate respects single-responsibility.
### A3. Express placement coordinates in cube coordinates (col/row)
Rejected. mm coordinates flow consistently between the visualizer and
mesh layout (for nearest-router computation). Cube coordinates are
undefined until the router grid is fixed, so they are unsuitable as
placement input.
### A4. Lazy view projection generation
Rejected (currently). The four views are cheap to build (typically <
100 ms), and eager construction guarantees `TopologyGraph` as the
single source of truth.
### A5. Visualizer output in formats besides SVG (PNG/PDF)
Rejected. SVG is vector + text-searchable + directly renderable in
browsers. PNG conversion, when required, is downstream
post-processing (e.g., rsvg-convert).
## Consequences
- ADR-0006's high-level intent is fleshed out via D1D7; topology
changes can be assessed quickly via D8's table.
- D3's mesh-layout algorithm is ADR-locked, so future PE attachment
patterns (e.g., a 6-zone HBM split) make clear which stage they
affect.
- D5's edge-kind list and D7's view structure are explicit, giving PR
reviewers a quick map of where (builder + router + visualizer) a
new component type ripples through.
- D2's source_hash invalidation rules are explicit, so a stale
`cube_mesh.yaml` (e.g., when only bandwidth changed) is recognized
as correct behavior.
+143
View File
@@ -0,0 +1,143 @@
# ADR-0054: Milestone Eval Benches — self-contained sweep + figure benches
## Status
Accepted (2026-05-22).
Amends ADR-0044 (D1/D2) and ADR-0045 (D5) and supersedes the "logic lives
in `scripts/` + `tests/`" arrangement of ADR-0043/0044: the GEMM and
allreduce evaluation harnesses are now self-contained **benches** that a
user runs to regenerate every result + figure.
## Context
ADR-0043 (allreduce eval) and ADR-0044 (GEMM eval) split each harness into
a **sweep** (a manual `scripts/` driver, or — for allreduce — the
parametrized tests themselves) plus **figure tests** that render committed
data. The sweep/render logic therefore lived under `scripts/gemm_sweep.py`,
`tests/gemm/_gemm_plot_helpers.py`, and `tests/sccl/_allreduce_helpers.py`.
A milestone requirement ("refactor allreduce + GEMM evaluation so a user
can run *one bench* to generate all the results and plots") cannot be met
by that layout: a bench is production code and **must not import from
`tests/`** (ADR-0007 layer direction). The eval logic had to move into
production, reachable from a bench.
The chosen home is the bench module itself — not a separate
`kernbench.eval` package. A bench file may contain arbitrary module-level
code; collapsing the harness into the bench keeps one file per domain and
avoids an extra package layer.
## Decision
### D1. Two milestone benches own the eval logic
- `src/kernbench/benches/milestone_1h_gemm.py` — GEMM shape×variant sweep +
the three figure renderers (moved from `scripts/gemm_sweep.py` +
`tests/gemm/_gemm_plot_helpers.py`).
- `src/kernbench/benches/milestone_1h_ccl.py` — the distributed allreduce
driver, latency + buffer-kind sweeps, topology diagram, FSIM comparison,
and the direct-launch parity reference (moved from
`tests/sccl/_allreduce_helpers.py`).
Each file is the **single home** for its domain's eval logic.
### D2. The "eval bench" pattern (extends ADR-0045 D5)
ADR-0045 D5 fixed a bench to a single configuration (single-SIP, or the
ADR-0024 multi-SIP CCL exception). This ADR adds a third pattern:
- An **eval bench** may drive *many* configurations and render figures. It
builds its own per-config `GraphEngine` / `RuntimeContext` instances
(one per sweep point) rather than using the outer `run_bench` engine.
- Because the outer ctx then has no submitted handles, the bench submits a
**sentinel tensor** (`torch.zeros((1, 1), …)`) at the end to satisfy
`run_bench`'s "must submit at least one request" contract (ADR-0045 D4),
so the CLI exits 0.
### D3. Output location
Both benches write to `src/kernbench/benches/1H_milestone_output/{gemm,ccl}/`
(per user request — artifacts beside the bench). The directory holds only
generated PNG/CSV/JSON (never a `.py`/`__init__.py`), so the eager-import
audit (ADR-0045 first action) ignores it — `pkgutil.iter_modules` does not
yield non-package subdirectories. It is **committed** (like the
`docs/diagrams/` artifacts) so the figures are viewable on the remote;
rerunning the bench regenerates it in place.
### D4. GEMM heavy sweep — fresh by default, `MILESTONE_FAST` to reuse
`milestone-1h-gemm` runs the full 24-sim sweep by default (minutes; one
shape is 2048 tiles). `MILESTONE_FAST=1` reuses the committed
`docs/diagrams/gemm_sweep.json` and only re-renders (seconds). This
reverses ADR-0044 D1/D2's "heavy sweep stays a manual/`slow`-marked step":
running the bench *is* the regeneration. The slow path is exercised by a
`@pytest.mark.slow` bench test; the fast path runs by default.
### D5. Tests + script reuse via thin re-export shims (single home kept)
The pre-existing figure tests and the `scripts/gemm_sweep.py` entry point
are retained and now reuse the bench modules:
- `tests/gemm/_gemm_plot_helpers.py` → re-exports the renderers +
`GEMM_SWEEP_JSON`/`GEMM_PLOTS_DIR`/`ROOT` from
`kernbench.benches.milestone_1h_gemm`.
- `tests/sccl/_allreduce_helpers.py` → re-exports the driver core, config
writers, sweep constants, renderers, and disk aggregators from
`kernbench.benches.milestone_1h_ccl`, and keeps the **pytest-only** pieces
local: the `pytest.param` matrices (`CONFIGS` / `_sweep_params` /
`_bk_params`) and the fixture-coupled `_run_distributed`
(`monkeypatch.chdir` + `_drive_distributed`) wrapper.
- `scripts/gemm_sweep.py` → thin wrapper over the bench's `run_sweep`.
Tests importing a bench module is permitted (tests sit above production,
ADR-0007); it triggers the whole-package eager audit, which already runs on
every `kernbench` invocation. matplotlib stays lazily imported inside the
renderers, so the audit's startup cost is unchanged.
### D6. Flat module naming (no `benches/` subfolder)
A `benches/` subpackage named `1H_milestone…` is impossible — a Python
package name cannot start with a digit. The benches are therefore flat
modules `milestone_1h_gemm.py` / `milestone_1h_ccl.py` with bench names
`milestone-1h-gemm` / `milestone-1h-ccl` (kebab-case, letter-first per
ADR-0045 D1).
## Consequences
### Positive
- `kernbench run --bench milestone-1h-gemm` (or `…-ccl`) regenerates all of
a domain's results + figures in one command — the milestone requirement.
- Single source for the eval logic (the bench), reused by tests and the
script via shims; no duplication.
- The figure tests and `scripts/gemm_sweep.py` keep working unchanged.
### Negative / limitations
- The two bench files are large (the CCL one mixes the distributed driver,
sweeps, and matplotlib drawing). A "bench" that is mostly an eval harness
is unusual; this ADR legitimizes it.
- Generated artifacts live inside the source tree (`src/kernbench/benches/`)
by explicit request and are committed (so the figures are viewable on the
remote); rerunning the bench regenerates them.
- `milestone-1h-ccl` (and the default `milestone-1h-gemm`) take minutes —
acceptable for an on-demand milestone artifact, not for routine runs.
## Dependencies
- **ADR-0007**: layer direction (why tests may import production but a bench
may not import tests).
- **ADR-0043 / ADR-0044**: the allreduce / GEMM eval harnesses this ADR
relocates into benches.
- **ADR-0045**: bench module contract; D2 here extends its D5 (single-device
rule) with the eval-bench pattern, and relies on D4 (NO_REQUESTS) for the
sentinel.
- **ADR-0024**: rank = SIP launcher driven by the allreduce sweeps.
## Open questions
- Should the GEMM theoretical-model constants (ADR-0044 D5) be sourced from
ADR-0033/0014 rather than copied? Unchanged by this ADR.
- Should `build_overview_slides.py` consume the milestone output PNGs
instead of drawing GEMM bars natively? Still open (ADR-0044 D6 / Negative).
+175
View File
@@ -0,0 +1,175 @@
# ADR Index
Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **47**.
Classification mirrors the `/report` skill's section assignment. When adding a new ADR, also add an entry to the `CLASSIFICATION` table in `tools/generate_adr_index.py`.
## Design Principles
- [ADR-0013](./ADR-0013-ver-verification-strategy.md) — Verification Strategy and Phase 1 Test Plan
- [ADR-0033](./ADR-0033-lat-latency-model-assumptions.md) — Latency Model: Assumptions and Known Simplifications
## High-level Architecture
- [ADR-0003](./ADR-0003-dev-target-system-hierarchy.md) — Target System Hierarchy & Modeling Scope _(System hierarchy (Tray / SIP / CUBE / PE))_
- [ADR-0007](./ADR-0007-api-runtime-api-boundaries.md) — Runtime API and Simulation Engine Boundaries _(Runtime API ↔ sim_engine boundaries)_
- [ADR-0016](./ADR-0016-dev-iochiplet-noc-and-memory-path.md) — IOChiplet NOC and Memory Data Path _(IOChiplet NOC and memory data path)_
- [ADR-0017](./ADR-0017-dev-cube-noc-and-hbm-connectivity.md) — Cube NOC and HBM Connectivity _(Cube NOC and HBM connectivity)_
## Detailed Architecture
One subsection per component file under `src/kernbench/components/builtin/`.
### forwarding
- [ADR-0037](./ADR-0037-dev-forwarding-component.md) — Forwarding Component (forwarding_v1)
### hbm_ctrl
- [ADR-0034](./ADR-0034-dev-hbm-controller-internal-design.md) — HBM Controller Internal Design
### io_cpu
- [ADR-0036](./ADR-0036-dev-io-cpu-component-model.md) — IO_CPU Component Model
### m_cpu
- [ADR-0035](./ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md) — M_CPU and M_CPU.DMA Component Model
### pcie_ep
- [ADR-0038](./ADR-0038-dev-pcie-ep-component-model.md) — PCIE_EP Component Model
### pe_cpu
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
### pe_dma
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
### pe_fetch_store
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
### pe_gemm
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
### pe_ipcq
- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
### pe_math
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
### pe_mmu
- [ADR-0039](./ADR-0039-dev-pe-mmu-component-model.md) — PE_MMU Component Model — Component + Utility Dual Role
### pe_scheduler
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
### pe_tcm
- [ADR-0040](./ADR-0040-dev-pe-tcm-component-model.md) — PE_TCM Component Model — Dual-Channel BW Serialization
### sram
- [ADR-0041](./ADR-0041-dev-cube-sram-component-model.md) — Cube SRAM Component Model — terminal scratchpad on cube NoC
### tiling
- [ADR-0042](./ADR-0042-prog-tile-plan-generators.md) — Tile Plan Generators — GEMM/Math Pipeline Plan Builders
## Implementation Decisions
### Address Scheme
- [ADR-0001](./ADR-0001-mem-physaddr-layout.md) — 51-bit Physical Address Layout & Decoding Contract
- [ADR-0011](./ADR-0011-mem-memory-addressing-simplification.md) — Memory Addressing — PA / VA / LA Address Models
### Routing & Helper API
- [ADR-0002](./ADR-0002-lat-routing-distance.md) — Routing Distance, Ordering & Bypass Rules
- [ADR-0051](./ADR-0051-lat-routing-helper-api.md) — Routing Helper API — `AddressResolver` + `PathRouter`
### Memory Semantics & Local-HBM Bandwidth
- [ADR-0004](./ADR-0004-mem-memory-semantics-local-hbm.md) — Memory Semantics & Local-HBM Bandwidth Guarantee
### Topology Compilation, Diagrams & Builder Algorithms
- [ADR-0005](./ADR-0005-dev-diagram-views-distance-layout.md) — Diagram Views & Distance-Aware Layout Rules
- [ADR-0006](./ADR-0006-dev-topology-compilation-distance-diagram.md) — Topology Compilation, Distance Extraction, and Automatic Diagram Generation
- [ADR-0053](./ADR-0053-dev-topology-builder-algorithms.md) — Topology Builder + Visualizer Algorithms
### Tensor Deployment and Allocation
- [ADR-0008](./ADR-0008-api-tensor-deploy-and-allocation.md) — Tensor Deployment and Allocation (Host Allocator, PA-first)
### Kernel Execution and Host-Device Messaging
- [ADR-0009](./ADR-0009-api-kernel-execution-messaging.md) — Kernel Execution Messaging and Completion Semantics
- [ADR-0012](./ADR-0012-api-host-io-message-schema.md) — Host ↔ IO_CPU Message Schema (PA-first, PE-tagged)
### CLI Surface and Semantics
- [ADR-0010](./ADR-0010-api-cli-surface-and-semantics.md) — Command Line Interface and Execution Semantics
### Component Port/Wire Fabric Model
- [ADR-0015](./ADR-0015-dev-component-port-wire-model.md) — Component Port/Wire Model and Fabric Routing
### Two-Pass Data Execution
- [ADR-0020](./ADR-0020-prog-data-execution-two-pass.md) — 2-Pass Data Execution Model (Timing / Data Separation)
### 2D Grid Program Identity
- [ADR-0022](./ADR-0022-prog-program-id-2d-grid.md) — 2D Grid program_id Semantics
### Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)
- [ADR-0024](./ADR-0024-par-sip-tp-launcher.md) — SIP-level Launcher — rank = SIP
- [ADR-0026](./ADR-0026-par-dppolicy-intra-device.md) — DPPolicy = Intra-Device Only — remove sip/num_sips fields
- [ADR-0027](./ADR-0027-par-megatron-tp.md) — Megatron-style Tensor Parallelism API
- [ADR-0047](./ADR-0047-par-ahbm-ccl-backend.md) — AHBM CCL Backend — `torch.distributed`-compat shim
- [ADR-0050](./ADR-0050-par-ccl-algorithm-module-contract.md) — CCL Algorithm Module Contract — `ccl/algorithms/*.py`
### IPCQ Direction Addressing
- [ADR-0025](./ADR-0025-algo-ipcq-direction-addressing.md) — IPCQ Direction Addressing — address-based matching
### Intercube All-Reduce
- [ADR-0032](./ADR-0032-algo-intercube-allreduce.md) — Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange
### Evaluation Harnesses
- [ADR-0043](./ADR-0043-eval-allreduce-harness.md) — Allreduce Evaluation Harness — `tests/sccl/`
- [ADR-0044](./ADR-0044-eval-gemm-harness.md) — GEMM Evaluation Harness — `scripts/gemm_sweep.py` + `tests/gemm/`
- [ADR-0054](./ADR-0054-eval-milestone-benches.md) — Milestone Eval Benches — self-contained sweep + figure benches
### Bench Module Contract
- [ADR-0045](./ADR-0045-prog-bench-module-contract.md) — Bench Module Contract — registration, dispatch, and authoring
### Kernel-side tl.* API (TLContext)
- [ADR-0046](./ADR-0046-prog-tl-context-contract.md) — TLContext — Kernel-side `tl.*` API Contract
### Memory Allocator Algorithms
- [ADR-0048](./ADR-0048-mem-allocator-algorithms.md) — Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
### Probe Subcommand
- [ADR-0049](./ADR-0049-ver-probe-subcommand.md) — `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
### Sim-engine Op Log and Memory Store Schemas
- [ADR-0052](./ADR-0052-dev-oplog-memory-store-schemas.md) — OpLog + MemoryStore Schemas — sim_engine internals
+836
View File
@@ -0,0 +1,836 @@
# KernBench — Architecture Design Document
*2026 1H*
KernBench is a system-level, discrete-event simulator for AI-accelerator
chiplet systems. It models the data-movement and control paths across
the full hardware hierarchy and reports end-to-end execution latency
for kernels dispatched to the device's compute units.
This document is a public summary of the architecture as designed and
implemented in the first half of 2026. It assumes no prior knowledge of
the simulator's internal documents; terms specific to the system are
defined on first use.
---
## Design Principles
KernBench is grounded in two foundational commitments: every measured
latency must trace to explicit, modeled events on the simulator's graph,
and every behavioral claim must be verifiable through tests that target
spec-level invariants rather than incidental implementation details.
<!-- src: ADR-0013 Context, Decision -->
The verification posture is verification-driven. Tests are written to
validate the architectural contracts that the simulator exposes —
correct routing, deterministic results, monotonic latency under
increasing hop counts — rather than to mirror the call graph of the
implementation. Two phases coexist: a fast timing phase that exercises
the simulator's discrete-event engine and produces a log of operations
with timestamps, and an optional data-replay phase that uses that log
to compute real numerical results. Tests can target either phase.
<!-- src: ADR-0033 Context, Decision -->
The latency model is intentionally abstract rather than
cycle-accurate. Each modeled node contributes a configurable per-node
overhead, each link contributes wire delay plus byte-over-bandwidth
serialization, and each terminal service contributes its own service
time. The simulator does not attempt to reproduce cache coherence
protocols, microarchitectural pipelines, or full PCIe/UCIe protocol
correctness; those are explicitly outside the scope. The aim is a
simulator that compares system-level configurations meaningfully and
deterministically, not one that ships microarchitectural truths.
<!-- src: ADR-0033 Decision, Consequences -->
Determinism is a hard requirement. Given identical inputs — topology,
routing policy, and request stream — the simulator must produce
identical outputs, hop traces included. This rules out reliance on
unordered set iteration on the critical path and forces every latency
contribution to come from an explicitly scheduled event on a modeled
component or link. There are no implicit waits, no hardcoded magic
delays, and no shortcuts that bypass the modeled graph.
---
## High-level Architecture
<!-- src: ADR-0003 Context, Decision -->
The simulated system is a four-level hierarchy. A **Tray** holds one or
more **SIPs** (system-in-package), each containing a 2D mesh of
**CUBEs** plus one or more **IO chiplets** that connect the SIP to the
host. Each CUBE contains a regular grid of **PEs** (processing
elements) plus its own attached resources — high-bandwidth memory
(HBM), a per-cube SRAM scratchpad, and a management CPU (M_CPU). The PE
itself is a composite of nine sub-components rather than a monolithic
core. This hierarchy is fixed; the parameters along each axis (counts,
mesh dimensions, link widths) are configurable through the topology
spec.
<!-- src: ADR-0007 Context, Decision -->
A clean separation runs along the request flow. A **runtime API** at
the top is the host-facing surface; it exposes tensor and kernel
operations, owns host-side allocation metadata, and is topology-
agnostic — it does not route or fan out. Below it the **simulation
engine** decomposes runtime operations into discrete graph requests
(memory writes, memory reads, kernel launches, MMU map installs) and
schedules events deterministically. At the bottom, **components** model
device behavior on a graph of nodes connected by links; they
implement the actual latency contributions and pass requests along.
No component reaches up into the runtime API, and no runtime call
shortcuts the engine.
<!-- DIAGRAM: Four-level system hierarchy — Tray containing SIPs, each SIP showing its 2D cube mesh and IO chiplet; one cube blown up to show the router mesh, attached PEs, M_CPU, SRAM, and HBM partition. -->
### Tray
<!-- src: ADR-0003 Decision -->
The Tray is the outermost boundary. It owns the host CPU on one side
and one or more SIPs on the other, connected through a fabric switch.
For collective communication that must traverse multiple SIPs, the
fabric switch acts as the common rendezvous: device-side outbound
traffic from one SIP routes through the switch and back into the
target SIP's IO chiplet.
### SIP
<!-- src: ADR-0003 Decision, ADR-0017 Context -->
A SIP packages a 2D mesh of CUBEs and one or more IO chiplets. The
default topology used by the simulator is a 4×4 cube mesh; the
mesh dimensions are configurable. Each cube on the boundary of the
mesh connects to its neighbors over UCIe (die-to-die) links arranged
on the four cardinal sides — north, south, east, and west. The IO
chiplets sit on one side of the SIP and provide the bridge to the host
across PCIe.
<!-- src: ADR-0016 Context, Decision -->
The IO chiplet itself contains its own internal network. A
host-facing PCIe endpoint passes traffic to a small NOC ("network on
chip"); from there it can branch to a control-plane CPU that processes
kernel-launch messages, or it can take the direct memory data path to
the cube's HBM controller. The decision to provide a direct memory
path that bypasses the control CPU was a deliberate concession to
keep host-issued memory writes from paying control-plane overhead on
the data path.
### CUBE
<!-- src: ADR-0017 Decision -->
Each CUBE owns a 2D mesh of NOC routers and a set of attached
resources: PEs, the cube-local SRAM scratchpad, the management CPU
(M_CPU), and the HBM partition (split across multiple PE-private
slices for bandwidth). The router mesh uses deterministic XY routing.
Attached components do not connect to each other directly — they all
sit on the router mesh, and every cube-internal transfer pays the
mesh distance from source to destination.
<!-- src: ADR-0017 Decision -->
The HBM partition is per-PE: each PE owns one HBM slice, and the
controller exposes per-PE channels so that the same PE always
addresses the same set of HBM channels. This makes the local-HBM
bandwidth from a PE to its own slice predictable, while accesses to
another PE's slice — or a different cube's slice — pay the mesh
distance and any UCIe crossings.
### PE
<!-- src: ADR-0014 Context, Decision -->
A PE is not a monolithic core. Internally it is a set of nine
sub-components, each modeling one stage of a request's flow: a small
control CPU, a tile-pipeline scheduler, a DMA engine, a fetch-store
engine that moves data between the on-PE scratchpad and the register
file, a GEMM compute engine, a math compute engine, the tightly-
coupled memory (TCM, the on-PE scratchpad), an MMU for virtual-to-
physical address translation, and an inter-PE collective queue
(IPCQ). The scheduler decomposes higher-level operations into per-tile
stage sequences, and tile tokens self-route from one sub-component
to the next.
<!-- DIAGRAM: PE internal layout — the nine sub-components and the edges that connect them; tile token flowing through DMA_READ → FETCH → GEMM → STORE → DMA_WRITE. -->
---
## Detailed Architecture
This section describes each modeled device-side component in turn.
Components are listed in the alphabetical order used by the
simulator's source tree.
### forwarding
<!-- src: ADR-0037 Context, Decision -->
The forwarding component is the generic routing relay used wherever a
node only needs to apply a small processing overhead and pass the
request to the next hop. NOC routers, conn nodes, and ucie phys all
reduce to this. Its first act on receiving a request is to apply the
per-node overhead configured for it in the topology spec; after the
overhead it simply hands the request to the next hop along the path.
<!-- src: ADR-0037 Decision, Consequences -->
The decision to share one implementation across these roles was made
to keep the simulator's component set small without sacrificing
modeling fidelity. Each instance still carries its own overhead and
its own link bandwidth contributions, so different roles still produce
different timing. What is shared is the dispatcher loop, not the
parameter values.
### hbm_ctrl
<!-- src: ADR-0034 Context, Decision -->
The HBM controller is the terminal node for all memory traffic that
reaches HBM. Internally it owns a number of pseudo channels, partitioned
per-PE so that each PE addresses a deterministic subset. On a request
arrival the controller first selects the right pseudo channel from the
target address, then enters a chunk-loop that drains the requested
size in fixed-size flits over the channel's bandwidth.
<!-- src: ADR-0034 Decision, Consequences -->
The chunk-loop pattern replaces an earlier all-at-once drain. The
benefit is that the controller no longer presents a flit-aware fabric
with a single bulk transfer; instead it emits flits at a paced rate
matching the channel bandwidth, which makes cross-flow contention
visible. The bandwidth budget is calibrated against the configured
HBM total bandwidth divided across the channel count.
### io_cpu
<!-- src: ADR-0036 Context, Decision -->
The IO_CPU is the control-plane processor sitting inside the IO chiplet.
It receives kernel-launch messages from the host, decodes them, and
dispatches per-cube launches to the cube's management CPU. Pure memory
operations bypass it entirely, taking the direct data path established
inside the IO chiplet.
<!-- src: ADR-0036 Decision -->
On receiving a kernel-launch message, the IO_CPU consults the message's
shard list — which already names the target SIP, cube, and PE for each
piece of the tensor argument — and forwards a per-cube launch to each
cube the kernel needs to reach. This makes the IO_CPU a deterministic
fan-out point: it does not decode physical addresses to route, it just
follows the explicit per-shard targets it was handed.
### m_cpu
<!-- src: ADR-0035 Context, Decision -->
The M_CPU is the cube's management processor. It owns two distinct
roles: as a control-plane fan-out point for kernel launches arriving
from the IO chiplet, and as a DMA endpoint for host-initiated memory
writes that need to land in this cube's HBM. The control role
forwards launches to the right PE control CPUs; the DMA role places
the actual bytes into HBM through the router mesh.
<!-- src: ADR-0035 Decision -->
The component model deliberately distinguishes the two roles because
their routing differs: the control fan-out path uses command-kind
links that do not appear on data-path routes, while the DMA path uses
the same router mesh as PE-initiated DMA, with PE-internal nodes
excluded. The routing layer knows about both modes and selects the
appropriate adjacency at request time.
### pcie_ep
<!-- src: ADR-0038 Context, Decision -->
The PCIE endpoint is the protocol boundary at the host-device edge.
Its first act on each incoming request is to apply a configured
protocol-processing overhead; after that it simply forwards. There is
no internal queuing model, no retry, and no TLP-level fidelity — those
are deliberately outside scope. The endpoint is bidirectional: host →
device traffic (memory writes, kernel launches) flows one way, and
device-side outbound traffic (cross-SIP collective sends) flows the
other.
<!-- src: ADR-0038 Decision, Alternatives Considered, Consequences -->
A more detailed PCIe model was considered and rejected. The simulator
is targeting system-level latency comparisons; making the endpoint
heavier with credit-management and retry logic would not improve the
metrics being studied. The decision keeps the endpoint as the
documented protocol-boundary node, named consistently so routing
helpers can locate it by SIP and IO instance.
### pe_cpu
<!-- src: ADR-0014 Decision -->
The PE control CPU is the entry point for kernel work arriving from
the cube's management CPU. It receives kernel-launch messages, resolves
the kernel function by name, and hands execution to the scheduler with
the resolved tensor arguments. From the scheduler's point of view, the
PE_CPU is the upstream source of high-level commands; from the rest
of the system's point of view, the PE_CPU is where a kernel's
execution begins on a given PE.
### pe_dma
<!-- src: ADR-0014 Decision, ADR-0023 Decision -->
The DMA engine on each PE has two distinct modes. In the standard PE
pipeline it consumes tile tokens issued by the scheduler, acquires a
read or write channel (modeled as a one-in-flight resource per
direction), and runs the bytes to or from HBM through the mesh. In
its collective mode it forwards send tokens for the cube's IPCQ into
the fabric, snapshotting the source data at send time so later
mutations cannot race the receiver's read. Both modes share the same
channel resources but differ in their downstream handling — one
returns when the round-trip completes, the other dispatches
fire-and-forget.
### pe_fetch_store
<!-- src: ADR-0014 Decision -->
The fetch-store engine is the bridge between the on-PE scratchpad
(TCM) and the register file. It does not run DMA; it only moves bytes
internally. On receiving a tile-stage token it sends a short request
to the TCM, waits for the bandwidth-serialized delay, and continues
the pipeline. The split between this engine and the TCM lets the
scratchpad model its own read/write bandwidth independently.
### pe_gemm
<!-- src: ADR-0014 Decision -->
The GEMM engine is the matrix-multiply compute unit. Tile tokens
arriving at this stage carry the per-tile dimensions, and the engine
contributes a service time accounting for one fused multiply-add over
the tile's macs. Composite operations (where the same tensor pair is
streamed across many tiles) reuse the engine through the scheduler;
the engine itself is stateless between tiles.
### pe_ipcq
<!-- src: ADR-0023 Context, Decision -->
The IPCQ — inter-process communication queue — is each PE's
collective-communication endpoint. It owns ring buffers that hold
inbound messages from neighbor PEs and bookkeeping for send credits.
Direction names ("N", "S", "E", "W" for cube-internal neighbors and
"global_*" for cross-SIP neighbors) are resolved to physical peer
endpoints by a neighbor table installed at process-group creation
time. The component itself does not move bytes — it issues DMA tokens
through the local PE_DMA, which performs the actual cross-PE
transfer.
<!-- src: ADR-0023 Decision, Consequences -->
A key invariant is that the inbound terminal — where data lands at
the receiver — pays the link bandwidth drain plus any cube-internal
mesh hop to the slot's backing memory. This prevents IPCQ from
silently outpacing raw DMA at large transfer sizes. Outbound sends
are fire-and-forget; credit return is the only backpressure signal.
### pe_math
<!-- src: ADR-0014 Decision -->
The math engine handles element-wise and reduction operations. It
consumes tile tokens carrying an operation kind (`exp`, `sum`, `max`,
`where`, etc.) and contributes a service time proportional to the
number of elements processed. Like the GEMM engine it is stateless;
chained epilogues (a sequence of math operations after a GEMM tile)
are scheduled as separate stages.
### pe_mmu
<!-- src: ADR-0039 Context, Decision -->
The MMU has two roles, exposed through one component. As a node on
the cube NOC it receives MMU-map and MMU-unmap messages and updates
its internal page table, so that the runtime API can install
virtual-to-physical mappings with measured fabric latency. As a
utility object held inside the PE it offers synchronous translate
calls to the PE's DMA and GEMM engines without taking simulator time
itself; the calling engine pays any configured TLB overhead in its
own process.
<!-- src: ADR-0039 Decision, Alternatives Considered -->
The page table supports multiple disjoint regions inside a single
page, with later-write-wins semantics on overlap. This is a deliberate
simulator stopgap to support parallelization policies that shard data
at sub-page granularity without silent mis-routing through a real
hardware MMU's one-PA-per-entry assumption. A real MMU does not work
this way; the model documents this as a simplification.
### pe_scheduler
<!-- src: ADR-0014 Decision -->
The scheduler is the sole dispatcher inside a PE. Simple commands are
routed directly to the right engine. Composite commands generate a
tile plan, and the resulting tile tokens are fed into the pipeline.
Self-routing keeps the scheduler off the per-stage hot path: each
engine, on finishing a stage, advances the token to the next stage's
component itself, so the scheduler only does initial dispatch and
completion tracking.
### pe_tcm
<!-- src: ADR-0040 Context, Decision -->
The TCM is the per-PE tightly-coupled scratchpad memory. It models
time only, not data — the actual payload lives in the simulator's
memory store. Read and write are independent channels: each is
modeled as a one-in-flight resource, so same-direction requests
serialize but a read and a write can overlap. The bandwidth of each
direction is configured separately and applied as bytes-over-bandwidth
on each request.
<!-- src: ADR-0040 Decision, Alternatives Considered -->
The decision to keep read and write on separate channels was made
because the PE pipeline's normal case overlaps fetch (read) and store
(write). Collapsing them into a single shared channel would have
artificially serialized that overlap and produced an incorrect
bandwidth ceiling.
### sram
<!-- src: ADR-0041 Context, Decision -->
The cube SRAM is a per-cube scratchpad attached to one of the cube's
routers. As a node it applies a configured access overhead, pays the
link-bandwidth drain stamped on the incoming request, and sends a
response on the reverse path. It is a terminal — it does not forward.
<!-- src: ADR-0041 Decision, Consequences -->
A second role is as one of three backing-memory tiers (TCM, SRAM, HBM)
that an inter-PE collective slot can live in. When the slot lives in
SRAM, the PE_DMA pays the slot read or write latency directly using
the configured SRAM bandwidth and overhead; the SRAM component does
not need to know about collective semantics. This separation keeps
the SRAM component agnostic to the collective subsystem.
### tiling
<!-- src: ADR-0042 Context, Decision -->
The tile-plan generator is not a runtime component — it is a pure
module of functions that take a problem shape (matrix dimensions, tile
sizes) and produce an ordered list of tile-stage sequences. The
scheduler consumes this list. Each tile's stage sequence depends on
how its operands are staged: operands streamed from HBM produce
DMA_READ stages, operands already resident in TCM (because they were
loaded eagerly upfront) skip them.
<!-- src: ADR-0042 Decision, Consequences -->
The plan generator is intentionally pure — given the same input it
returns the same plan, with no simulator events created. This lets
the rest of the system reason about tile sequences as data, and it
makes the plan testable in isolation without simulator state. New
plan variants (for example, K-major or DTensor-aware plans) can be
added as new functions following the same shape.
---
## Implementation Decisions
This section collects cross-cutting decisions — algorithms, policies,
schemes, and contracts — that span multiple components rather than
living inside one.
### Address Scheme
<!-- src: ADR-0001 Context, Decision -->
Every physical address in the simulator decodes into a structured
location. A fixed-width physical address carries the SIP id, the
cube id within the SIP, a type discriminator (HBM vs PE-resource vs
others), and a type-specific offset. HBM addresses additionally encode
the per-PE slice offset so the controller can determine which PE
owns the target slice without external lookup. The layout is
deliberately reserved rather than packed-to-fit, so new sub-units can
be added at the type-discriminator level without rewriting existing
addresses.
<!-- src: ADR-0011 Context, Decision -->
On top of physical addressing, the simulator supports three address
models that the runtime API selects between. Direct physical
addressing is retained as a fallback. Virtual addressing — the
current default — gives each tensor a contiguous virtual range at
deployment, with the per-PE MMU translating per access; an
alternative logical-address scheme remains a future option. The
virtual-address path is what every modern test path takes; the PA
fallback is used by the MMU itself when no mapping exists for an
address (a deliberate signal, not an error).
<!-- src: ADR-0011 Decision, Consequences -->
Tensor placement is represented as a list of physical-address shards,
each tagged with target SIP, cube, and PE, plus a single tensor-wide
virtual base. This means a kernel sees one virtual base for the whole
tensor while the host driver and the engine still know exactly where
each shard lives. Replicated tensors get per-cube local PA mappings;
sharded tensors broadcast their mapping across cubes within a SIP.
### Routing, Distance & Helper API
<!-- src: ADR-0002 Context, Decision -->
Routing is policy-driven, deterministic, and topology-aware. Given a
source, a destination, and an intent — for example, PE-initiated
DMA versus host-initiated memory write versus a generic
component-to-component query — the routing layer picks the right
path. The intent matters because different traffic types must avoid
different categories of edges: PE-initiated DMA should not traverse
command-only links; M_CPU DMA should not pass through PE-internal
pipeline edges; cube-local transfers should not use the
zero-distance UCIe bus that would otherwise look attractive to a
shortest-path search.
<!-- src: ADR-0051 Decision -->
The routing layer therefore maintains four separate adjacency graphs
at construction, each excluding a different category of edges, and
picks the appropriate one per intent. On top of the graphs sits a
helper API that hides the topology's naming convention: callers ask
for the PCIe endpoint of a given SIP, the M_CPU of a given cube, or
the HBM destination for a given physical address, and receive the
corresponding node id. No component constructs node-id strings
directly; if the naming convention ever changes, the change is local
to the helper layer.
<!-- src: ADR-0051 Decision, Consequences -->
Path-finding itself uses Dijkstra with explicit per-edge weights
(routing weight is allowed to differ from physical distance — for
example, UCIe is configured to be routing-preferable). Tie-breaks
follow insertion order, which keeps results deterministic. Paths
between unreachable nodes raise rather than returning empty, surfacing
topology errors immediately.
### Memory Semantics and Local-HBM Bandwidth
<!-- src: ADR-0004 Context, Decision -->
A PE accessing its own HBM slice through its own cube's NOC must see
the full local HBM bandwidth — that is the model's intent. Memory
traffic accumulates latency from per-component overhead and
bytes-over-link-bandwidth serialization along the path, but the
controller does not throttle below the slice's allotted bandwidth.
Cross-PE-slice accesses inside the same cube, cross-cube accesses
through UCIe, and cross-SIP accesses through PCIe each pay
progressively more overhead as the path grows.
### Topology Compilation, Diagrams & Builder Algorithms
<!-- src: ADR-0006 Context, Decision -->
Topology is configurable, not hardcoded. The simulator reads a YAML
spec, compiles it into a flat graph of nodes and edges plus four
view projections at different abstraction levels — system, SIP, cube,
PE — and uses the compiled graph as the single source for both
execution and visualization. Distance metadata used by routing is
extracted at compile time so that diagrams and routing decisions
agree by construction.
<!-- src: ADR-0005 Context, Decision -->
Diagrams are derived artifacts of the compiled topology. The visualizer
produces one SVG per view at the appropriate abstraction level; nothing
in the diagrams is hand-drawn or hand-positioned. Distance-aware
layout rules place nodes in the diagrams using the same coordinates
that routing uses to compute distance, so a diagram that "looks
wrong" is a signal that the topology itself has a problem, not the
visualizer.
<!-- src: ADR-0053 Decision -->
Inside a cube the router mesh is generated automatically. PE corner
positions are fixed by convention; the relay-column algorithm
inserts additional grid columns whenever the gap between adjacent PE
columns would exceed a tunable maximum. HBM occupies a central
exclusion zone — router slots inside the zone are deliberately empty,
since HBM controllers attach as separate named nodes. M_CPU and SRAM
attach to the nearest router by Euclidean distance from their
configured placement coordinates, and UCIe physical lanes distribute
along the boundary rows and columns. The whole mesh is cached
beside the topology spec and invalidated only when one of a small set
of layout-relevant fields changes.
<!-- DIAGRAM: One cube's router mesh — rows × cols of routers with HBM exclusion zone in the middle, PEs/M_CPU/SRAM attaching to nearest routers, UCIe PHYs along the perimeter. -->
### Tensor Deployment and Allocation
<!-- src: ADR-0008 Context, Decision -->
Tensor deployment in the runtime API produces a list of physical-address
shards plus a single tensor-wide virtual base. The host allocator
walks the data-parallelism policy, computes per-shard placement, and
emits the per-shard physical addresses through the per-PE allocators.
No separate "allocate then later attach to a device" RPC exists —
allocation and deployment are a single operation that produces a
deployed tensor handle.
### Memory Allocator Algorithms
<!-- src: ADR-0048 Context, Decision -->
Each per-PE allocator owns two channels — HBM slice and TCM — each
backed by an offset-keyed free-list. Allocation is first-fit; freeing
coalesces with adjacent free blocks. A device-wide virtual allocator
sits above the per-PE allocators, aligns requests up to the configured
page size, and coalesces on free in the same way. The trade-off is
explicit: first-fit is simpler and cheaper than best-fit or buddy
allocation, and the simulator's workload is stack-like enough
(deploy / kernel / free in matched order) that fragmentation is not
a practical concern.
<!-- src: ADR-0048 Decision, Consequences -->
Allocation failure raises rather than silently returning a partial
result. A partial tensor reaching the engine would route over wrong
PAs and silently corrupt simulator output, so an out-of-memory signal
is preferred. The free path trusts its caller to pass back exactly
what was allocated; the small risk of caller error in exchange for
fast common-case freeing is documented as a deliberate trade.
### Kernel Execution and Host-Device Messaging
<!-- src: ADR-0009 Context, Decision -->
Kernel execution decomposes into a small set of messages that travel
the device graph. The host issues a single kernel-launch message; the
IO_CPU fans it out per-cube; the cube M_CPU fans it out per-PE; the
PE CPU resolves the kernel and runs it through the scheduler.
Completion flows back the same way, gated by per-shard completion
tracking. Memory operations follow the same pattern: a memory write
or read travels as one message that the engine routes to the right
HBM controller, with a response taking the reverse path.
<!-- src: ADR-0012 Context, Decision -->
The schema between the host and the device-side IO CPU is PA-first
and shard-tagged. Every byte of host-issued payload arrives with an
explicit target SIP, cube, PE, and physical address. The IO_CPU does
not decode addresses to derive placement — placement is named
explicitly by the shard list. This makes the host-device interface
deterministic and keeps the routing helper free of host-derived
intent.
### CLI Surface and Semantics
<!-- src: ADR-0010 Context, Decision -->
The command-line interface exposes four subcommands. A bench runner
loads a topology, resolves a registered benchmark by name or index,
and runs it on a selected device. A bench-listing command enumerates
the registered benchmarks. A probe utility runs a fixed catalog of
traffic patterns through the engine for latency and bandwidth
verification. A web viewer renders the topology in a browser. A
benchmark instance is always single-device by convention; multi-SIP
collective work happens inside the benchmark through the launcher
abstraction, not by multiplexing the CLI.
### Component Port and Wire Fabric Model
<!-- src: ADR-0015 Context, Decision -->
Every modeled component exposes input and output ports, and every
edge in the topology connects an output port on one component to an
input port on another. Bandwidth and propagation delay are properties
of the wire between ports, not of the component endpoints. A
component's responsibility is to apply its configured per-node
overhead and either forward to the next hop or terminate; the wire
charges the byte-over-bandwidth serialization separately.
<!-- src: ADR-0015 Decision, Consequences -->
This separation lets components be swapped behind their port
interface without changing the rest of the model, and it keeps
bandwidth contention at the wire level where multiple components may
contend for the same edge. Future component models can refine
internal behavior without disturbing the fabric.
### Two-Pass Data Execution
<!-- src: ADR-0020 Context, Decision -->
The simulator runs in two passes. The first pass — fast and always
on — runs the discrete-event engine and records every data operation
in an operation log with timestamps, component identifiers, and per-
operation parameters. The second pass — optional, opt-in — replays
the log against an in-memory tensor store to produce actual numerical
results. Tests that only need timing skip the second pass; tests that
need to verify correctness opt in.
<!-- src: ADR-0020 Decision, Consequences -->
The split lets the timing engine remain unconcerned with data
semantics: kernels move handles around, not bytes. The replay phase
recovers data semantics from the recorded operations, in their
original time order with a small set of secondary-sort rules. The
op-log records carry enough metadata — input snapshots for compute
operations, source snapshots for cross-component copies — that the
replay phase cannot mis-order with respect to in-flight mutations.
### Sim-engine Op Log and Memory Store Schemas
<!-- src: ADR-0052 Context, Decision -->
The operation log holds typed records with seven fields each: start
and end timestamps, the component that issued the operation, an
operation kind ("memory", "gemm", "math"), an operation name, a
parameter dictionary, and a (currently unused) dependency list.
Records are kept in stable timestamp order. The parameter dictionary
varies by operation: a DMA read carries source address and byte count;
a GEMM carries operand shapes, dtypes, and address spaces; a math
operation carries input addresses and snapshots.
<!-- src: ADR-0052 Decision, Consequences -->
The companion memory store is a two-level dictionary keyed by
address space ("hbm", "tcm", "sram", others) and integer address.
Reads and writes are reference-based — no copy by default — so
callers wanting to detach a snapshot must copy explicitly. This is
deliberate: the engine-internal snapshot paths copy at well-defined
points (math input capture, HBM source capture for DMA writes,
inbound collective copies) and downstream replay code therefore
sees stable data even when slot or scratch addresses are reused by
later operations.
### 2D Grid Program Identity
<!-- src: ADR-0022 Context, Decision -->
Inside a kernel the program identity is two-dimensional. The
first axis corresponds to the PE index within a cube; the second
corresponds to the cube index within a SIP. Together they let a
kernel address its position both within its cube and within the
larger system without needing to know the full topology. Total
program counts along each axis are exposed symmetrically.
### Parallelism — SIP Launcher, DPPolicy, Megatron-TP, AHBM Backend, and CCL Algorithm Module
<!-- src: ADR-0024 Context, Decision -->
The launcher model treats each SIP as one rank. Inside a process the
launcher spawns one greenlet per SIP rank; the rank is bound to its
greenlet so that any code running in that worker sees the right
distributed-style rank. This is a deliberately PyTorch-compatible
shape: a benchmark looks like a small DDP training script — initialize
a process group, spawn workers, each worker runs the same body.
<!-- src: ADR-0026 Context, Decision -->
Data-parallelism policy lives in a single object that names the
sharding strategy along the cube axis (replicate, row-wise,
column-wise) and along the PE axis (same set of values), and optionally
overrides the number of cubes or PEs participating. The policy is
intra-device — it does not cross SIP boundaries. SIP-level parallelism
is the launcher's responsibility, and the two axes compose
orthogonally.
<!-- src: ADR-0027 Context, Decision -->
A Megatron-style tensor-parallel API sits on top of the launcher and
the DP policy. Layer-level building blocks — column-parallel linear,
row-parallel linear, all-reduce — name their sharding intent in terms
the launcher and the placement policy can compose. This is the layer
that bench code typically writes against.
<!-- src: ADR-0047 Context, Decision -->
For collective operations the runtime exposes a PyTorch-compatible
distributed backend named "ahbm". On process-group initialization the
backend loads the configured collective-algorithm module, resolves
the world size (priority: explicit ccl.yaml override → defaults
section → topology SIP count), imports the algorithm module
dynamically, derives the SIP topology kind, and pushes the inter-PE
neighbor table to every participating PE. From that point on, an
all-reduce call dispatches the algorithm's kernel function across
all ranks.
<!-- src: ADR-0050 Context, Decision -->
A collective-algorithm module is a Python module with a small, fixed
contract. It exposes topology-kind integer constants, a name-to-kind
mapping for the YAML configuration, a kernel-arguments builder, and
a kernel function — the kernel function being aliased to the name
`kernel` so the backend can find it generically. The kernel itself
takes the tensor pointer, the per-cube element count, cube mesh
width and height, the world size, the current rank, and the SIP
topology dimensions; the backend appends those last four arguments
automatically. New collectives slot in by adding a new module that
follows this shape.
<!-- src: ADR-0027 Decision, Consequences -->
The combination is deliberate: bench authors get to write code that
looks like a regular distributed training script, while the launcher,
backend, and placement policies behind it remain free to redirect
work to the right SIP, cube, and PE without exposing topology to the
kernel.
### IPCQ Direction Addressing
<!-- src: ADR-0025 Context, Decision -->
Inside a collective algorithm, peer PEs are named by direction —
"N", "S", "E", "W" for cube-internal neighbors, and "global_*" for
cross-SIP neighbors. Direction addressing is the addressing scheme:
the algorithm names a direction, the IPCQ neighbor table installed
at process-group time resolves the direction to the peer endpoint's
physical-address coordinates, and the PE_DMA performs the actual
transfer. The algorithm itself does not see PA arithmetic — direction
is the user-facing handle.
### Intercube All-Reduce
<!-- src: ADR-0032 Context, Decision -->
The default all-reduce algorithm uses a center-rooted bidirectional
phase inside each SIP's cube mesh followed by an inter-SIP exchange
on the mesh's root cube, and then a bidirectional broadcast back
out. Center-rooting halves the in-cube hop count compared with a
corner-rooted walk. The inter-SIP exchange itself follows the
configured SIP topology — ring, torus, or non-wrapping mesh —
selected at runtime through the SIP-topology kind integer the
backend passes to the kernel.
### Evaluation Harnesses
<!-- src: ADR-0043 Context, Decision -->
The all-reduce evaluation harness drives correctness and the
latency/buffer-kind sweeps through the public distributed path —
initialize process group, spawn workers, call all-reduce — rather
than the lower-level engine interface. A shared helper module factors
out the setup; sweep tests cover the buffer-kind tiers (TCM, SRAM,
HBM) and the inter-SIP topology variants. The plots produced by the
harness are part of its output contract; the harness regenerates them
on demand.
<!-- src: ADR-0044 Context, Decision -->
The GEMM evaluation harness is split into two layers. A heavy
shape-and-variant sweep lives as a manual script — it runs the same
composite-GEMM benchmark across many shapes and operand-staging
variants, harvests the resulting op-log, and writes a JSON summary.
A faster figure-generation layer lives in the test suite and consumes
that JSON to render plots. The split keeps the heavy data
generation explicit and out of the regular test path.
### Bench Module Contract
<!-- src: ADR-0045 Context, Decision -->
Adding a new benchmark requires only dropping a file into the
benchmarks directory. The file registers one or more benchmark
functions through a small decorator that takes a kebab-case name and
a human-readable description. The decorator is the registration
mechanism — there is no separate manifest. Each benchmark function
takes one argument, conventionally named `torch`, which is the
runtime context exposing tensor allocation, kernel launch,
distributed APIs, and process-spawning. The function name is `run` by
convention.
<!-- src: ADR-0045 Decision, Consequences -->
A benchmark must submit at least one operation, or the runner
returns an error. A benchmark instance is single-device by default;
when a benchmark is collective, it uses the distributed-process-spawn
pattern internally — one worker greenlet per rank, with each worker
binding to its rank. Multi-device benchmark patterns outside that
shape are not supported.
### Kernel-side `tl.*` API
<!-- src: ADR-0046 Context, Decision -->
Inside a kernel function, the `tl` argument exposes the kernel-side
API in a shape that mirrors the conventions of established
GPU-kernel languages. Categories: reference handles that name HBM
data without issuing DMA; data movement (load, store) that does
issue DMA; GEMM and math compute (dot, composite, the unary and
binary math operations, reductions); index and scalar helpers
(program identity, range-builders); metadata-only operations like
transpose; and the collective primitives (send, receive,
non-blocking receive). Tensor handles support arithmetic operators
via a thread-local active context so kernel code reads naturally.
<!-- src: ADR-0046 Decision, Consequences -->
The API supports two execution modes. A command-list mode records
operations into a list without consuming simulator time — useful for
inspection and lightweight tests. A greenlet-driven mode runs the
kernel as a child greenlet that switches back to the simulator on
each `tl.*` call; the simulator drives the event scheduler and hands
real data back to the kernel as DMA reads complete. The two modes
share the same surface; the kernel does not know which one it is
running under.
### Probe Subcommand
<!-- src: ADR-0049 Context, Decision -->
The probe utility runs three families of traffic patterns through
the engine — host-to-device writes at increasing hop counts,
device-to-host reads at increasing hop counts, and PE-initiated DMA
across the cube mesh — and reports actual latency, the analytical
formula breakdown, effective bandwidth, bottleneck bandwidth, and
utilization. A fixed reference size is used for the summary table;
a separate utilization-versus-size sweep covers a logarithmic range
of transfer sizes. Each case runs in its own engine instance so
cases do not perturb each other.
<!-- src: ADR-0049 Decision, Consequences -->
The probe also checks a small set of invariants automatically:
monotonic latency increase with hop count, device-to-host latency
at least as large as host-to-device for the same hop count, and a
faster best-case path than worst-case for cross-cube PE DMA. Failures
print prominently. The output is meant for human reading; automated
parsing should not depend on column widths or whitespace.
---
This document summarizes 46 architecture decisions captured during
the first half of 2026. It is regenerated mechanically from the
decision corpus; sources are recorded in HTML comments throughout.
+8 -225
View File
@@ -1,237 +1,20 @@
"""Sweep GEMM shapes through kernbench and dump PE_accelerator engine times. """Sweep GEMM shapes through kernbench and dump PE_accelerator engine times.
For each shape: Thin wrapper: the sweep logic now lives in
- run benches.matmul_composite via the same run_bench path the CLI uses ``kernbench.benches.milestone_1h_gemm`` (the single home, ADR-0054, also the
- read result.engine.op_log ``milestone-1h-gemm`` bench). This script remains the manual entry point for
- filter to per-PE engines: pe_dma, pe_fetch_store, pe_gemm, pe_math regenerating ``docs/diagrams/gemm_sweep.json`` on demand and honors the same
- record sum-of-durations (engine occupancy) AND wall-clock active interval ``SWEEP_SHAPES`` / ``SWEEP_TOPOLOGY`` env overrides.
Output: docs/diagrams/gemm_sweep.json python scripts/gemm_sweep.py
""" """
from __future__ import annotations from __future__ import annotations
import json from kernbench.benches.milestone_1h_gemm import run_sweep
import os
import sys
import time
from pathlib import Path
# Default sweep covering under-tile, single-tile, multi-tile, and asymmetric regimes.
# Each entry is either a single integer (square M=K=N=S) or "MxKxN".
# Override via env: SWEEP_SHAPES="16,32,16x2048x16,..."
DEFAULT_SHAPES = [
"32x32x32", # 1 tile, K=32 < TILE_K=64 → under-tile in K
"32x64x32", # 1 tile, exact single-tile fit
"32x128x32", # 2 tiles, aligned
"32x128x128", # 8 tiles, aligned
"32x3072x32", # 48 tiles, all K-axis (tall-skinny)
"8x128x128", # 8 tiles, but M=8 < TILE_M=32 → MAC array under-fed
"128x8x128", # 16 tiles, but K=8 < TILE_K=64 → MAC array under-fed
"512", # 2048 tiles, fully aligned — "well-pipelined" reference
]
# Operand-staging variants exercised per shape.
VARIANTS = ["ref_ref", "load_ref", "load_load"]
# Engines whose timings we collect (component_id suffix match).
ENGINES = ["pe_dma", "pe_fetch_store", "pe_gemm", "pe_math"]
# Per-stage breakdown labels (StageType enum names from pe_types.py).
STAGES = ["DMA_READ", "DMA_WRITE", "FETCH", "STORE", "GEMM", "MATH"]
# Scheduler tile sizes (mirror of PeSchedulerComponent.TILE_M/K/N).
TILE_M, TILE_K, TILE_N = 32, 64, 32
OUT_PATH = Path(__file__).parent.parent / "docs" / "diagrams" / "gemm_sweep.json"
def _engine_wall_ns(records, suffix: str) -> float:
"""Wall-clock interval the engine was active (union of overlapping ops)."""
intervals = [(r.t_start, r.t_end) for r in records
if r.component_id.endswith("." + suffix)]
if not intervals:
return 0.0
intervals.sort()
merged_end = intervals[0][1]
merged_start = intervals[0][0]
total = 0.0
for s, e in intervals[1:]:
if s <= merged_end:
merged_end = max(merged_end, e)
else:
total += merged_end - merged_start
merged_start, merged_end = s, e
total += merged_end - merged_start
return total
def _engine_occupancy_ns(records, suffix: str) -> float:
return sum(r.t_end - r.t_start for r in records
if r.component_id.endswith("." + suffix))
def _engine_count(records, suffix: str) -> int:
return sum(1 for r in records if r.component_id.endswith("." + suffix))
def _stage_occupancy_ns(records, stage_type: str) -> float:
"""Sum t_end - t_start over op_log records whose params.stage_type matches.
Requires op_log records produced post the TileToken stage_type capture
(sim_engine/op_log.py).
"""
return sum(
r.t_end - r.t_start
for r in records
if r.params.get("stage_type") == stage_type
)
def _stage_wall_ns(records, stage_type: str) -> float:
"""Interval-union wall-clock for records whose stage_type matches."""
intervals = sorted(
(r.t_start, r.t_end) for r in records
if r.params.get("stage_type") == stage_type
)
if not intervals:
return 0.0
total = 0.0
cs, ce = intervals[0]
for s, e in intervals[1:]:
if s <= ce:
ce = max(ce, e)
else:
total += ce - cs
cs, ce = s, e
total += ce - cs
return total
def _stage_count(records, stage_type: str) -> int:
return sum(1 for r in records if r.params.get("stage_type") == stage_type)
def _run_one(M: int, K: int, N: int, topology: str, variant: str = "ref_ref") -> dict:
os.environ["MATMUL_M"] = str(M)
os.environ["MATMUL_K"] = str(K)
os.environ["MATMUL_N"] = str(N)
os.environ["MATMUL_VARIANT"] = variant
# Late imports so env vars are read by matmul_composite at module load.
# Force re-import to pick up new env values.
for mod_name in [m for m in list(sys.modules) if m.startswith("kernbench.benches.matmul_composite")]:
del sys.modules[mod_name]
from kernbench.benches.registry import resolve as resolve_bench
from kernbench.runtime_api.bench_runner import run_bench
from kernbench.runtime_api.types import resolve_device
from kernbench.sim_engine.engine import GraphEngine
from kernbench.topology.builder import resolve_topology
topo = resolve_topology(topology)
bench = resolve_bench("matmul-composite").run
device = resolve_device(None)
t0 = time.time()
result = run_bench(
topology=topo, bench_fn=bench, device=device,
engine_factory=lambda t, d: GraphEngine(
getattr(t, "topology_obj", t), enable_data=True,
),
)
wall = time.time() - t0
op_log = result.engine.op_log
if not result.completion.ok:
raise RuntimeError(f"bench failed at M={M},K={K},N={N}: {result.completion}")
# Bytes touched at f16 (2 B): full A + full B + full out (each operand
# streamed once through HBM by the composite plan).
bytes_total = (M * K + K * N + M * N) * 2
row = {
"M": M, "K": K, "N": N,
"variant": variant,
"flops": 2 * M * K * N,
"bytes_hbm": bytes_total,
"arith_intensity": (2 * M * K * N) / bytes_total, # flops/byte
"tile_count_expected": _ceil(M, TILE_M) * _ceil(N, TILE_N) * _ceil(K, TILE_K),
"sim_wall_clock_s": round(wall, 3),
"engines": {},
}
for eng in ENGINES:
row["engines"][eng] = {
"occupancy_ns": _engine_occupancy_ns(op_log, eng),
"wall_ns": _engine_wall_ns(op_log, eng),
"record_count": _engine_count(op_log, eng),
}
row["stages"] = {}
for stage in STAGES:
row["stages"][stage] = {
"occupancy_ns": _stage_occupancy_ns(op_log, stage),
"wall_ns": _stage_wall_ns(op_log, stage),
"record_count": _stage_count(op_log, stage),
}
# Kernel-window wall-clock = max t_end - min t_start over PE engine records.
pe_records = [r for r in op_log
if any(r.component_id.endswith("." + e) for e in ENGINES)]
if pe_records:
row["pe_window_ns"] = max(r.t_end for r in pe_records) \
- min(r.t_start for r in pe_records)
else:
row["pe_window_ns"] = 0.0
stage_records = [r for r in op_log
if r.params.get("stage_type") in STAGES]
if stage_records:
row["composite_window_ns"] = max(r.t_end for r in stage_records) \
- min(r.t_start for r in stage_records)
else:
row["composite_window_ns"] = 0.0
return row
def _ceil(a: int, b: int) -> int:
return (a + b - 1) // b
def main() -> int: def main() -> int:
shapes_env = os.environ.get("SWEEP_SHAPES") run_sweep()
raw = (shapes_env.split(",") if shapes_env else DEFAULT_SHAPES)
shapes: list[tuple[int, int, int]] = []
for s in raw:
s = s.strip()
if not s:
continue
if "x" in s.lower():
parts = s.lower().split("x")
shapes.append((int(parts[0]), int(parts[1]), int(parts[2])))
else:
v = int(s)
shapes.append((v, v, v))
topology = os.environ.get("SWEEP_TOPOLOGY", "topology.yaml")
rows = []
for M, K, N in shapes:
for variant in VARIANTS:
print(f"[sweep] M={M} K={K} N={N} variant={variant} ...", flush=True)
row = _run_one(M, K, N, topology, variant=variant)
rows.append(row)
eng_dma = row["engines"]["pe_dma"]
eng_gem = row["engines"]["pe_gemm"]
print(f" tiles={row['tile_count_expected']:>6} "
f"pe_window={row['pe_window_ns']:8.1f}ns "
f"dma_occ={eng_dma['occupancy_ns']:9.1f} "
f"gemm_occ={eng_gem['occupancy_ns']:8.1f} "
f"(sim {row['sim_wall_clock_s']:.1f}s)")
OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
OUT_PATH.write_text(json.dumps({
"tile_sizes": {"M": TILE_M, "K": TILE_K, "N": TILE_N},
"engines": ENGINES,
"stages": STAGES,
"variants": VARIANTS,
"rows": rows,
}, indent=2))
print(f"\n[sweep] wrote {OUT_PATH}")
return 0 return 0
Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 36 KiB

@@ -0,0 +1,13 @@
buffer_kind,sip_topology,n_sips,n_elem,bytes_per_pe,latency_ns
hbm,torus_2d,6,128,256,2120.040000000012
hbm,torus_2d,6,1024,2048,2717.2783333333473
hbm,torus_2d,6,8192,16384,7315.184999999989
hbm,torus_2d,6,32768,65536,23081.26500000037
sram,torus_2d,6,128,256,2060.040000000012
sram,torus_2d,6,1024,2048,2909.2783333333473
sram,torus_2d,6,8192,16384,9523.184999999869
sram,torus_2d,6,32768,65536,32201.265000000385
tcm,torus_2d,6,128,256,1964.040000000012
tcm,torus_2d,6,1024,2048,2477.2783333333473
tcm,torus_2d,6,8192,16384,6403.185000000109
tcm,torus_2d,6,32768,65536,19865.265000000378
1 buffer_kind sip_topology n_sips n_elem bytes_per_pe latency_ns
2 hbm torus_2d 6 128 256 2120.040000000012
3 hbm torus_2d 6 1024 2048 2717.2783333333473
4 hbm torus_2d 6 8192 16384 7315.184999999989
5 hbm torus_2d 6 32768 65536 23081.26500000037
6 sram torus_2d 6 128 256 2060.040000000012
7 sram torus_2d 6 1024 2048 2909.2783333333473
8 sram torus_2d 6 8192 16384 9523.184999999869
9 sram torus_2d 6 32768 65536 32201.265000000385
10 tcm torus_2d 6 128 256 1964.040000000012
11 tcm torus_2d 6 1024 2048 2477.2783333333473
12 tcm torus_2d 6 8192 16384 6403.185000000109
13 tcm torus_2d 6 32768 65536 19865.265000000378
Binary file not shown.

After

Width:  |  Height:  |  Size: 75 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 86 KiB

@@ -0,0 +1,37 @@
algorithm,sip_topology,n_sips,n_elem,bytes_per_pe,bytes_per_sip,latency_ns
lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,8,16,256,2666.552500000015
lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,32,64,1024,2747.7400000000152
lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,64,128,2048,2855.990000000018
lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,128,256,4096,3072.490000000019
lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,512,1024,16384,3337.1133333333582
lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,1024,2048,32768,3708.0333333333692
lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,2048,4096,65536,4449.873333333393
lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,4096,8192,131072,5933.020000000124
lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,8192,16384,262144,8900.379999999863
lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,16384,32768,524288,14835.099999999224
lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,32768,65536,1048576,26704.540000000765
lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,49152,98304,1572864,38573.97999999701
lrab_hierarchical_allreduce,ring_1d,6,8,16,256,2365.255833333347
lrab_hierarchical_allreduce,ring_1d,6,32,64,1024,2436.9433333333473
lrab_hierarchical_allreduce,ring_1d,6,64,128,2048,2532.526666666683
lrab_hierarchical_allreduce,ring_1d,6,128,256,4096,2723.693333333349
lrab_hierarchical_allreduce,ring_1d,6,512,1024,16384,3048.635000000021
lrab_hierarchical_allreduce,ring_1d,6,1024,2048,32768,3393.4016666666957
lrab_hierarchical_allreduce,ring_1d,6,2048,4096,65536,4082.401666666714
lrab_hierarchical_allreduce,ring_1d,6,4096,8192,131072,5458.80166666677
lrab_hierarchical_allreduce,ring_1d,6,8192,16384,262144,8216.934999999943
lrab_hierarchical_allreduce,ring_1d,6,16384,32768,524288,13733.201666665835
lrab_hierarchical_allreduce,ring_1d,6,32768,65536,1048576,24765.73500000064
lrab_hierarchical_allreduce,ring_1d,6,49152,98304,1572864,35798.268333331536
lrab_hierarchical_allreduce,torus_2d,6,8,16,256,1700.6025000000095
lrab_hierarchical_allreduce,torus_2d,6,32,64,1024,1753.2900000000102
lrab_hierarchical_allreduce,torus_2d,6,64,128,2048,1823.540000000012
lrab_hierarchical_allreduce,torus_2d,6,128,256,4096,1964.040000000012
lrab_hierarchical_allreduce,torus_2d,6,512,1024,16384,2196.8183333333463
lrab_hierarchical_allreduce,torus_2d,6,1024,2048,32768,2477.2783333333473
lrab_hierarchical_allreduce,torus_2d,6,2048,4096,65536,3038.1983333333583
lrab_hierarchical_allreduce,torus_2d,6,4096,8192,131072,4159.5050000000665
lrab_hierarchical_allreduce,torus_2d,6,8192,16384,262144,6403.185000000109
lrab_hierarchical_allreduce,torus_2d,6,16384,32768,524288,10890.5449999995
lrab_hierarchical_allreduce,torus_2d,6,32768,65536,1048576,19865.265000000378
lrab_hierarchical_allreduce,torus_2d,6,49152,98304,1572864,28839.98500000059
1 algorithm sip_topology n_sips n_elem bytes_per_pe bytes_per_sip latency_ns
2 lrab_hierarchical_allreduce mesh_2d_no_wrap 6 8 16 256 2666.552500000015
3 lrab_hierarchical_allreduce mesh_2d_no_wrap 6 32 64 1024 2747.7400000000152
4 lrab_hierarchical_allreduce mesh_2d_no_wrap 6 64 128 2048 2855.990000000018
5 lrab_hierarchical_allreduce mesh_2d_no_wrap 6 128 256 4096 3072.490000000019
6 lrab_hierarchical_allreduce mesh_2d_no_wrap 6 512 1024 16384 3337.1133333333582
7 lrab_hierarchical_allreduce mesh_2d_no_wrap 6 1024 2048 32768 3708.0333333333692
8 lrab_hierarchical_allreduce mesh_2d_no_wrap 6 2048 4096 65536 4449.873333333393
9 lrab_hierarchical_allreduce mesh_2d_no_wrap 6 4096 8192 131072 5933.020000000124
10 lrab_hierarchical_allreduce mesh_2d_no_wrap 6 8192 16384 262144 8900.379999999863
11 lrab_hierarchical_allreduce mesh_2d_no_wrap 6 16384 32768 524288 14835.099999999224
12 lrab_hierarchical_allreduce mesh_2d_no_wrap 6 32768 65536 1048576 26704.540000000765
13 lrab_hierarchical_allreduce mesh_2d_no_wrap 6 49152 98304 1572864 38573.97999999701
14 lrab_hierarchical_allreduce ring_1d 6 8 16 256 2365.255833333347
15 lrab_hierarchical_allreduce ring_1d 6 32 64 1024 2436.9433333333473
16 lrab_hierarchical_allreduce ring_1d 6 64 128 2048 2532.526666666683
17 lrab_hierarchical_allreduce ring_1d 6 128 256 4096 2723.693333333349
18 lrab_hierarchical_allreduce ring_1d 6 512 1024 16384 3048.635000000021
19 lrab_hierarchical_allreduce ring_1d 6 1024 2048 32768 3393.4016666666957
20 lrab_hierarchical_allreduce ring_1d 6 2048 4096 65536 4082.401666666714
21 lrab_hierarchical_allreduce ring_1d 6 4096 8192 131072 5458.80166666677
22 lrab_hierarchical_allreduce ring_1d 6 8192 16384 262144 8216.934999999943
23 lrab_hierarchical_allreduce ring_1d 6 16384 32768 524288 13733.201666665835
24 lrab_hierarchical_allreduce ring_1d 6 32768 65536 1048576 24765.73500000064
25 lrab_hierarchical_allreduce ring_1d 6 49152 98304 1572864 35798.268333331536
26 lrab_hierarchical_allreduce torus_2d 6 8 16 256 1700.6025000000095
27 lrab_hierarchical_allreduce torus_2d 6 32 64 1024 1753.2900000000102
28 lrab_hierarchical_allreduce torus_2d 6 64 128 2048 1823.540000000012
29 lrab_hierarchical_allreduce torus_2d 6 128 256 4096 1964.040000000012
30 lrab_hierarchical_allreduce torus_2d 6 512 1024 16384 2196.8183333333463
31 lrab_hierarchical_allreduce torus_2d 6 1024 2048 32768 2477.2783333333473
32 lrab_hierarchical_allreduce torus_2d 6 2048 4096 65536 3038.1983333333583
33 lrab_hierarchical_allreduce torus_2d 6 4096 8192 131072 4159.5050000000665
34 lrab_hierarchical_allreduce torus_2d 6 8192 16384 262144 6403.185000000109
35 lrab_hierarchical_allreduce torus_2d 6 16384 32768 524288 10890.5449999995
36 lrab_hierarchical_allreduce torus_2d 6 32768 65536 1048576 19865.265000000378
37 lrab_hierarchical_allreduce torus_2d 6 49152 98304 1572864 28839.98500000059
Binary file not shown.

After

Width:  |  Height:  |  Size: 194 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 40 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 45 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 45 KiB

File diff suppressed because it is too large Load Diff
+180
View File
@@ -0,0 +1,180 @@
"""Mesh-native bidirectional Ring-K/V attention kernel — prefill (ADR-0059 Proposed).
Each rank holds its own Q tile and 1/n_ranks of K, V (sequence-sharded).
Over ``n_ranks - 1`` bidirectional steps, K and V propagate both east and
west: chunk c_i originating at rank i reaches rank j at step ``|i - j|``.
Every rank receives every other rank's chunk **exactly once** and folds it
into a running ``(m, , o)`` via the online-softmax recurrence. After all
steps each rank holds the final attention output for its own Q tokens —
no cross-rank merge is required.
Supersedes ADR-0055's closed-ring ``_attention_ring_kv.py``. Both modules
stay on disk during the transition; this one runs on the hardware's
actual open-mesh wiring (no closed-ring SFR install required).
Imported by ``milestone_gqa_llama70b`` (after the bench's Phase 2 switches
its imports) and invoked through ``torch.launch(...)`` — not through
``dist.all_reduce(...)``. See ADR-0055 Context for why this kernel is not
backend-dispatched via ADR-0050's algorithm-module contract.
"""
from __future__ import annotations
from kernbench.common.pe_commands import TensorHandle
def _view(handle: TensorHandle, new_shape: tuple[int, ...]) -> TensorHandle:
"""Reshape — metadata only, no command emitted (cf. ``tl.trans``)."""
return TensorHandle(
id=handle.id,
addr=handle.addr,
shape=new_shape,
dtype=handle.dtype,
nbytes=handle.nbytes,
data=handle.data,
space=handle.space,
pinned=handle.pinned,
)
def _partial_attention(
Q: TensorHandle,
K: TensorHandle,
V: TensorHandle,
S_q: int,
S_kv_per_rank: int,
h_q: int,
d_head: int,
tl,
) -> tuple[TensorHandle, TensorHandle, TensorHandle]:
"""One pass of partial attention against (K, V).
Emits 1 GEMM(Q·K^T) + softmax + max + sub + exp + sum + 1 GEMM(P·V).
Returns the running-statistics triplet ``(m, , O_partial)`` for the
online-softmax mlo merge.
"""
K_2d_T = _view(K, (h_q * d_head, S_kv_per_rank))
V_2d = _view(V, (S_kv_per_rank, h_q * d_head))
scores = tl.dot(Q, K_2d_T)
m = tl.max(scores, axis=-1)
P = tl.softmax(scores, axis=-1)
scores_centered = scores - m
exp_scores = tl.exp(scores_centered)
ell = tl.sum(exp_scores, axis=-1)
O_partial = tl.dot(P, V_2d)
return m, ell, O_partial
def attention_mesh_kv_kernel(
q_ptr: int,
k_ptr: int,
v_ptr: int,
o_ptr: int,
S_q: int,
S_kv_per_rank: int,
h_q: int,
h_kv: int,
d_head: int,
n_ranks: int,
rank_axis: int = 0,
*,
tl,
) -> None:
"""Mesh-native bidirectional Ring-K/V attention — see module docstring.
``rank_axis`` selects which program-id dimension carries the ring rank:
0 — single_user_* panels: rank == tl.program_id(axis=0) (PE id in cube).
1 — multi_user_* panels: ring is at the cube level. Only PE 0 in each
cube participates; the other 7 hold KV replicas but stay silent.
"""
# For multi_user (rank_axis=1) only PE 0 in each cube runs the ring.
if rank_axis != 0 and tl.program_id(axis=0) != 0:
return
rank = tl.program_id(axis=rank_axis)
has_E = rank < n_ranks - 1
has_W = rank > 0
# Q stays put on this rank — loaded once, used in every partial attention.
Q = tl.load(q_ptr, shape=(S_q, h_q * d_head), dtype="f16")
# Local K, V chunk.
K = tl.load(k_ptr, shape=(S_kv_per_rank, h_kv, d_head), dtype="f16")
V = tl.load(v_ptr, shape=(S_kv_per_rank, h_kv, d_head), dtype="f16")
# Step 0 (local): partial attention against own K, V — initializes the
# running triplet (m, , o).
m, ell, o = _partial_attention(
Q, K, V, S_q, S_kv_per_rank, h_q, d_head, tl,
)
# Seed bidirectional waves with own chunk (step-1 send).
to_send_east_K: TensorHandle | None = K
to_send_east_V: TensorHandle | None = V
to_send_west_K: TensorHandle | None = K
to_send_west_V: TensorHandle | None = V
# Bidirectional fan-out: n_ranks - 1 steps. By step k, the wave from
# rank i has reached rank (i ± k). After n_ranks - 1 steps, every rank
# has merged every other rank's chunk exactly once (ADR-0059 D3).
for step in range(1, n_ranks):
# Send the eastbound wave we currently hold (own at step 1; forwarded
# at later steps). ``None`` means we have no wave to forward this
# direction this step (edge rank, or the wave already passed by).
if has_E and to_send_east_K is not None:
tl.send(dir="E", src=to_send_east_K)
tl.send(dir="E", src=to_send_east_V)
if has_W and to_send_west_K is not None:
tl.send(dir="W", src=to_send_west_K)
tl.send(dir="W", src=to_send_west_V)
# Receive eastbound wave from W (carries chunk c_{rank - step}).
K_from_W: TensorHandle | None = None
V_from_W: TensorHandle | None = None
if has_W and (rank - step) >= 0:
K_from_W = tl.recv(
dir="W", shape=(S_kv_per_rank, h_kv, d_head), dtype="f16",
)
V_from_W = tl.recv(
dir="W", shape=(S_kv_per_rank, h_kv, d_head), dtype="f16",
)
m_new, ell_new, o_new = _partial_attention(
Q, K_from_W, V_from_W, S_q, S_kv_per_rank, h_q, d_head, tl,
)
m_combined = tl.maximum(m, m_new)
scale_old = tl.exp(m - m_combined)
scale_new = tl.exp(m_new - m_combined)
ell = ell * scale_old + ell_new * scale_new
o = o * scale_old + o_new * scale_new
m = m_combined
# Receive westbound wave from E (carries chunk c_{rank + step}).
K_from_E: TensorHandle | None = None
V_from_E: TensorHandle | None = None
if has_E and (rank + step) < n_ranks:
K_from_E = tl.recv(
dir="E", shape=(S_kv_per_rank, h_kv, d_head), dtype="f16",
)
V_from_E = tl.recv(
dir="E", shape=(S_kv_per_rank, h_kv, d_head), dtype="f16",
)
m_new, ell_new, o_new = _partial_attention(
Q, K_from_E, V_from_E, S_q, S_kv_per_rank, h_q, d_head, tl,
)
m_combined = tl.maximum(m, m_new)
scale_old = tl.exp(m - m_combined)
scale_new = tl.exp(m_new - m_combined)
ell = ell * scale_old + ell_new * scale_new
o = o * scale_old + o_new * scale_new
m = m_combined
# Forward what we received for next step. ``None`` propagates: if no
# chunk arrived this step (out-of-bounds wave origin), there is
# nothing to forward next step in that direction.
to_send_east_K = K_from_W
to_send_east_V = V_from_W
to_send_west_K = K_from_E
to_send_west_V = V_from_E
# Final normalize: O := o / .
O_final = o / ell
tl.store(o_ptr, O_final)
@@ -0,0 +1,151 @@
"""Mesh-native bidirectional AllReduce-mlo attention — decode (ADR-0059 Proposed).
Every rank holds the full Q (replicated, small at ``S_q=1``) and 1/n_ranks
of KV (sequence-sharded). Each rank computes its partial attention
against own KV in ONE shot, then runs a bidirectional fan-out of the
``(m, , o)`` triplet: the triplet originating at rank i reaches rank j at
step ``|i - j|``. Every rank merges every other rank's triplet exactly
once over ``n_ranks - 1`` steps, ending with the final answer replicated
on every rank.
Supersedes ADR-0056's closed-ring ``_attention_allreduce_mlo.py``. Both
modules stay on disk during the transition; this one runs on the
hardware's actual open-mesh wiring (no closed-ring SFR install required).
Imported by ``milestone_gqa_llama70b`` (after the bench's Phase 2 switches
its imports) and invoked through ``torch.launch(...)`` — not through
``dist.all_reduce(...)``. See ADR-0056 Context for why this kernel is not
backend-dispatched via ADR-0050's algorithm-module contract.
"""
from __future__ import annotations
from kernbench.common.pe_commands import TensorHandle
def _view(handle: TensorHandle, new_shape: tuple[int, ...]) -> TensorHandle:
"""Reshape — metadata only, no command emitted (cf. ``tl.trans``)."""
return TensorHandle(
id=handle.id,
addr=handle.addr,
shape=new_shape,
dtype=handle.dtype,
nbytes=handle.nbytes,
data=handle.data,
space=handle.space,
pinned=handle.pinned,
)
def attention_mesh_mlo_kernel(
q_ptr: int,
k_ptr: int,
v_ptr: int,
o_ptr: int,
S_q: int,
S_kv_per_rank: int,
h_q: int,
h_kv: int,
d_head: int,
n_ranks: int,
rank_axis: int = 0,
*,
tl,
) -> None:
"""Mesh-native bidirectional AllReduce-mlo — see module docstring.
``rank_axis`` selects which program-id dimension carries the ring rank:
0 — single_user_* panels: rank == tl.program_id(axis=0) (PE id in cube).
1 — multi_user_* panels: ring is at the cube level. Only PE 0 in each
cube participates; the other 7 hold KV replicas but stay silent.
"""
# For multi_user (rank_axis=1) only PE 0 in each cube runs the ring.
if rank_axis != 0 and tl.program_id(axis=0) != 0:
return
rank = tl.program_id(axis=rank_axis)
has_E = rank < n_ranks - 1
has_W = rank > 0
# Q is replicated on every rank — loaded once.
Q = tl.load(q_ptr, shape=(S_q, h_q * d_head), dtype="f16")
# Local KV chunk. KV is sequence-sharded and stays put on this rank for
# the entire fan-out — distinguishing decode from prefill (ADR-0059 D3)
# where KV circulates.
K = tl.load(k_ptr, shape=(S_kv_per_rank, h_kv, d_head), dtype="f16")
V = tl.load(v_ptr, shape=(S_kv_per_rank, h_kv, d_head), dtype="f16")
# ── One-shot local partial attention ──────────────────────────
K_2d_T = _view(K, (h_q * d_head, S_kv_per_rank))
V_2d = _view(V, (S_kv_per_rank, h_q * d_head))
scores = tl.dot(Q, K_2d_T)
m = tl.max(scores, axis=-1)
P = tl.softmax(scores, axis=-1)
scores_centered = scores - m
exp_scores = tl.exp(scores_centered)
ell = tl.sum(exp_scores, axis=-1)
o = tl.dot(P, V_2d)
# Seed bidirectional waves with own triplet (step-1 send).
to_send_east_m: TensorHandle | None = m
to_send_east_ell: TensorHandle | None = ell
to_send_east_o: TensorHandle | None = o
to_send_west_m: TensorHandle | None = m
to_send_west_ell: TensorHandle | None = ell
to_send_west_o: TensorHandle | None = o
# Bidirectional fan-out of (m, , o) triplets — n_ranks - 1 steps.
for step in range(1, n_ranks):
# Send eastbound triplet (own at step 1; forwarded at later steps).
if has_E and to_send_east_m is not None:
tl.send(dir="E", src=to_send_east_m)
tl.send(dir="E", src=to_send_east_ell)
tl.send(dir="E", src=to_send_east_o)
# Send westbound triplet.
if has_W and to_send_west_m is not None:
tl.send(dir="W", src=to_send_west_m)
tl.send(dir="W", src=to_send_west_ell)
tl.send(dir="W", src=to_send_west_o)
# Receive eastbound triplet from W (originated at rank - step).
m_from_W: TensorHandle | None = None
ell_from_W: TensorHandle | None = None
o_from_W: TensorHandle | None = None
if has_W and (rank - step) >= 0:
m_from_W = tl.recv(dir="W", shape=m.shape, dtype="f16")
ell_from_W = tl.recv(dir="W", shape=ell.shape, dtype="f16")
o_from_W = tl.recv(dir="W", shape=o.shape, dtype="f16")
m_combined = tl.maximum(m, m_from_W)
scale_old = tl.exp(m - m_combined)
scale_new = tl.exp(m_from_W - m_combined)
ell = ell * scale_old + ell_from_W * scale_new
o = o * scale_old + o_from_W * scale_new
m = m_combined
# Receive westbound triplet from E (originated at rank + step).
m_from_E: TensorHandle | None = None
ell_from_E: TensorHandle | None = None
o_from_E: TensorHandle | None = None
if has_E and (rank + step) < n_ranks:
m_from_E = tl.recv(dir="E", shape=m.shape, dtype="f16")
ell_from_E = tl.recv(dir="E", shape=ell.shape, dtype="f16")
o_from_E = tl.recv(dir="E", shape=o.shape, dtype="f16")
m_combined = tl.maximum(m, m_from_E)
scale_old = tl.exp(m - m_combined)
scale_new = tl.exp(m_from_E - m_combined)
ell = ell * scale_old + ell_from_E * scale_new
o = o * scale_old + o_from_E * scale_new
m = m_combined
# Forward the original received triplet (not the merged running state)
# so neighbors get the original wave. ``None`` propagates if nothing
# arrived this step.
to_send_east_m = m_from_W
to_send_east_ell = ell_from_W
to_send_east_o = o_from_W
to_send_west_m = m_from_E
to_send_west_ell = ell_from_E
to_send_west_o = o_from_E
# Final normalize: O := o / .
O_final = o / ell
tl.store(o_ptr, O_final)
File diff suppressed because it is too large Load Diff
+568
View File
@@ -0,0 +1,568 @@
"""milestone-1h-gemm bench: GEMM evaluation harness (sweep + figures).
Self-contained milestone bench (ADR-0054). Holds the shape×variant sweep
and the figure renderers; the ``run(torch)`` entry at the bottom runs the
sweep (or reuses the committed JSON when ``MILESTONE_FAST=1``) and writes
every figure into ``benches/1H_milestone_output/gemm/``.
This is the single home for the GEMM eval logic: the figure tests import a
thin re-export shim (``tests/gemm/_gemm_plot_helpers.py``), as does the
``scripts/gemm_sweep.py`` wrapper.
The sweep drives ``matmul-composite`` across shapes×variants through the
same ``run_bench`` path the CLI uses, harvests ``result.engine.op_log``,
and writes the sweep JSON. The renderers read that JSON and emit matplotlib
PNGs. No simulation in the renderers — they are fast.
Chart set (mirrors the GEMM MAC slides in scripts/build_overview_slides.py):
- stage breakdown (load_ref operand staging)
- MAC utilization — measured (load_ref)
- MAC utilization — theoretical vs measured (load_ref)
"""
from __future__ import annotations
import json
import os
import sys
import time
from pathlib import Path
from kernbench.benches.registry import bench
from kernbench.policy.placement.dp import DPPolicy
ROOT = Path(__file__).resolve().parents[3]
DEFAULT_SWEEP_JSON = ROOT / "docs" / "diagrams" / "gemm_sweep.json"
DEFAULT_PLOTS_DIR = ROOT / "docs" / "diagrams" / "gemm_plots"
_OUTPUT_DIR = Path(__file__).resolve().parent / "1H_milestone_output" / "gemm"
# ── sweep configuration ────────────────────────────────────────────────
# Default sweep covering under-tile, single-tile, multi-tile, and asymmetric
# regimes. Each entry is "MxKxN" or a single int (square M=K=N).
# Override via env: SWEEP_SHAPES="16,32,16x2048x16,..."
DEFAULT_SHAPES = [
"32x32x32", # 1 tile, K=32 < TILE_K=64 → under-tile in K
"32x64x32", # 1 tile, exact single-tile fit
"32x128x32", # 2 tiles, aligned
"32x128x128", # 8 tiles, aligned
"32x3072x32", # 48 tiles, all K-axis (tall-skinny)
"8x128x128", # 8 tiles, but M=8 < TILE_M=32 → MAC array under-fed
"128x8x128", # 16 tiles, but K=8 < TILE_K=64 → MAC array under-fed
"512", # 2048 tiles, fully aligned — "well-pipelined" reference
]
# Operand-staging variants exercised per shape.
VARIANTS = ["ref_ref", "load_ref", "load_load"]
# Engines whose timings we collect (component_id suffix match).
ENGINES = ["pe_dma", "pe_fetch_store", "pe_gemm", "pe_math"]
# Per-stage breakdown labels (StageType enum names from pe_types.py).
STAGES = ["DMA_READ", "DMA_WRITE", "FETCH", "STORE", "GEMM", "MATH"]
# Scheduler tile sizes (mirror of PeSchedulerComponent.TILE_M/K/N).
TILE_M, TILE_K, TILE_N = 32, 64, 32
def _ceil(a: int, b: int) -> int:
return (a + b - 1) // b
def _engine_wall_ns(records, suffix: str) -> float:
"""Wall-clock interval the engine was active (union of overlapping ops)."""
intervals = [(r.t_start, r.t_end) for r in records
if r.component_id.endswith("." + suffix)]
if not intervals:
return 0.0
intervals.sort()
merged_end = intervals[0][1]
merged_start = intervals[0][0]
total = 0.0
for s, e in intervals[1:]:
if s <= merged_end:
merged_end = max(merged_end, e)
else:
total += merged_end - merged_start
merged_start, merged_end = s, e
total += merged_end - merged_start
return total
def _engine_occupancy_ns(records, suffix: str) -> float:
return sum(r.t_end - r.t_start for r in records
if r.component_id.endswith("." + suffix))
def _engine_count(records, suffix: str) -> int:
return sum(1 for r in records if r.component_id.endswith("." + suffix))
def _stage_occupancy_ns(records, stage_type: str) -> float:
return sum(
r.t_end - r.t_start
for r in records
if r.params.get("stage_type") == stage_type
)
def _stage_wall_ns(records, stage_type: str) -> float:
"""Interval-union wall-clock for records whose stage_type matches."""
intervals = sorted(
(r.t_start, r.t_end) for r in records
if r.params.get("stage_type") == stage_type
)
if not intervals:
return 0.0
total = 0.0
cs, ce = intervals[0]
for s, e in intervals[1:]:
if s <= ce:
ce = max(ce, e)
else:
total += ce - cs
cs, ce = s, e
total += ce - cs
return total
def _stage_count(records, stage_type: str) -> int:
return sum(1 for r in records if r.params.get("stage_type") == stage_type)
def _run_one(M: int, K: int, N: int, topology: str, variant: str = "ref_ref") -> dict:
os.environ["MATMUL_M"] = str(M)
os.environ["MATMUL_K"] = str(K)
os.environ["MATMUL_N"] = str(N)
os.environ["MATMUL_VARIANT"] = variant
# Late imports so env vars are read by matmul_composite at module load.
# Force re-import to pick up new env values.
for mod_name in [m for m in list(sys.modules)
if m.startswith("kernbench.benches.matmul_composite")]:
del sys.modules[mod_name]
from kernbench.benches.registry import resolve as resolve_bench
from kernbench.runtime_api.bench_runner import run_bench
from kernbench.runtime_api.types import resolve_device
from kernbench.sim_engine.engine import GraphEngine
from kernbench.topology.builder import resolve_topology
topo = resolve_topology(topology)
bench = resolve_bench("matmul-composite").run
device = resolve_device(None)
t0 = time.time()
result = run_bench(
topology=topo, bench_fn=bench, device=device,
engine_factory=lambda t, d: GraphEngine(
getattr(t, "topology_obj", t), enable_data=True,
),
)
wall = time.time() - t0
op_log = result.engine.op_log
if not result.completion.ok:
raise RuntimeError(f"bench failed at M={M},K={K},N={N}: {result.completion}")
# Bytes touched at f16 (2 B): full A + full B + full out (each operand
# streamed once through HBM by the composite plan).
bytes_total = (M * K + K * N + M * N) * 2
row = {
"M": M, "K": K, "N": N,
"variant": variant,
"flops": 2 * M * K * N,
"bytes_hbm": bytes_total,
"arith_intensity": (2 * M * K * N) / bytes_total, # flops/byte
"tile_count_expected": _ceil(M, TILE_M) * _ceil(N, TILE_N) * _ceil(K, TILE_K),
"sim_wall_clock_s": round(wall, 3),
"engines": {},
}
for eng in ENGINES:
row["engines"][eng] = {
"occupancy_ns": _engine_occupancy_ns(op_log, eng),
"wall_ns": _engine_wall_ns(op_log, eng),
"record_count": _engine_count(op_log, eng),
}
row["stages"] = {}
for stage in STAGES:
row["stages"][stage] = {
"occupancy_ns": _stage_occupancy_ns(op_log, stage),
"wall_ns": _stage_wall_ns(op_log, stage),
"record_count": _stage_count(op_log, stage),
}
# Kernel-window wall-clock = max t_end - min t_start over PE engine records.
pe_records = [r for r in op_log
if any(r.component_id.endswith("." + e) for e in ENGINES)]
if pe_records:
row["pe_window_ns"] = max(r.t_end for r in pe_records) \
- min(r.t_start for r in pe_records)
else:
row["pe_window_ns"] = 0.0
stage_records = [r for r in op_log
if r.params.get("stage_type") in STAGES]
if stage_records:
row["composite_window_ns"] = max(r.t_end for r in stage_records) \
- min(r.t_start for r in stage_records)
else:
row["composite_window_ns"] = 0.0
return row
def _parse_shapes(raw) -> list[tuple[int, int, int]]:
shapes: list[tuple[int, int, int]] = []
for s in raw:
s = s.strip()
if not s:
continue
if "x" in s.lower():
parts = s.lower().split("x")
shapes.append((int(parts[0]), int(parts[1]), int(parts[2])))
else:
v = int(s)
shapes.append((v, v, v))
return shapes
def run_sweep(out_json: Path | str = DEFAULT_SWEEP_JSON) -> Path:
"""Drive matmul-composite across shapes×variants; write the sweep JSON.
Honors ``SWEEP_SHAPES`` / ``SWEEP_TOPOLOGY`` env overrides (same as the
historical ``scripts/gemm_sweep.py``). Returns the JSON path written.
"""
shapes_env = os.environ.get("SWEEP_SHAPES")
raw = (shapes_env.split(",") if shapes_env else DEFAULT_SHAPES)
shapes = _parse_shapes(raw)
topology = os.environ.get("SWEEP_TOPOLOGY", "topology.yaml")
rows = []
for M, K, N in shapes:
for variant in VARIANTS:
print(f"[sweep] M={M} K={K} N={N} variant={variant} ...", flush=True)
row = _run_one(M, K, N, topology, variant=variant)
rows.append(row)
eng_dma = row["engines"]["pe_dma"]
eng_gem = row["engines"]["pe_gemm"]
print(f" tiles={row['tile_count_expected']:>6} "
f"pe_window={row['pe_window_ns']:8.1f}ns "
f"dma_occ={eng_dma['occupancy_ns']:9.1f} "
f"gemm_occ={eng_gem['occupancy_ns']:8.1f} "
f"(sim {row['sim_wall_clock_s']:.1f}s)")
out_json = Path(out_json)
out_json.parent.mkdir(parents=True, exist_ok=True)
out_json.write_text(json.dumps({
"tile_sizes": {"M": TILE_M, "K": TILE_K, "N": TILE_N},
"engines": ENGINES,
"stages": STAGES,
"variants": VARIANTS,
"rows": rows,
}, indent=2))
print(f"\n[sweep] wrote {out_json}")
return out_json
# ── figure rendering ───────────────────────────────────────────────────
# Shapes excluded from the figures (mirrors build_overview_slides).
EXCLUDED_SHAPES = {(512, 512, 512)}
# Stage bars shown (raw op_log stage_type keys) + display names + colors.
STAGE_KEYS = ["DMA_READ", "FETCH", "GEMM", "DMA_WRITE"]
STAGE_DISPLAY = {
"DMA_READ": "DMA in",
"FETCH": "Fetch",
"GEMM": "GEMM",
"DMA_WRITE": "DMA out",
}
STAGE_COLORS = {
"DMA_READ": "#3B82F6",
"FETCH": "#10B981",
"GEMM": "#F59E0B",
"DMA_WRITE": "#A855F7",
}
# MAC-utilization model constants (mirror build_overview_slides).
_HBM_GBS = 256.0
_BPE = 2
_T_STAGE = 16.0
_D_STAGES = 3
_PLOT_VARIANT = "load_ref"
def _load_sweep_data(sweep_json: Path | str = DEFAULT_SWEEP_JSON) -> dict:
sweep_json = Path(sweep_json)
if not sweep_json.exists():
return {"rows": []}
data = json.loads(sweep_json.read_text())
data["rows"] = [
r for r in data.get("rows", [])
if (r["M"], r["K"], r["N"]) not in EXCLUDED_SHAPES
]
return data
def _shape_label(r: dict) -> str:
if r["M"] == r["K"] == r["N"]:
return f"M=K=N={r['M']}"
return f"M={r['M']} K={r['K']} N={r['N']}"
def _under_tile(M, K, N, tile_M, tile_K, tile_N) -> bool:
return M < tile_M or K < tile_K or N < tile_N
def _xtick_labels(shape_labels, tile_counts, flagged) -> list[str]:
out = []
for lbl, tc, fl in zip(shape_labels, tile_counts, flagged):
s = f"{lbl}\n({tc} tiles)"
if fl:
s += " *"
out.append(s)
return out
def _grouped_bar_png(
out_name: str, *, out_dir: Path, title: str, subtitle: str | None,
shape_labels, tile_counts, flagged, series: dict, colors: dict,
y_label: str, threshold: float | None = None, footnote: str | None = None,
) -> str:
"""Render one grouped-bar chart to out_dir/out_name; return the path."""
import matplotlib.pyplot as plt
import numpy as np
n_groups = len(shape_labels)
n_series = max(1, len(series))
x = np.arange(n_groups)
width = 0.8 / n_series
fig, ax = plt.subplots(figsize=(11, 6))
for i, (name, vals) in enumerate(series.items()):
offset = (i - (n_series - 1) / 2) * width
ax.bar(x + offset, vals, width, label=name, color=colors.get(name))
ax.set_xticks(x)
ax.set_xticklabels(
_xtick_labels(shape_labels, tile_counts, flagged), fontsize=8,
)
ax.set_ylabel(y_label)
ax.set_title(title, fontsize=13, fontweight="bold")
if subtitle:
ax.text(0.5, 1.01, subtitle, transform=ax.transAxes, ha="center",
va="bottom", fontsize=8, color="#475569")
if threshold is not None:
ax.axhline(threshold, ls="--", color="gray", lw=1.0)
ax.legend(fontsize=8, loc="upper right")
ax.grid(True, axis="y", alpha=0.3)
caption = "* = under-tile shape (M<TILE_M, K<TILE_K, or N<TILE_N)"
if footnote:
caption = footnote + "\n" + caption
fig.text(0.5, 0.01, caption, ha="center", fontsize=7, color="gray",
wrap=True)
fig.tight_layout(rect=(0, 0.05, 1, 1))
out_dir = Path(out_dir)
out_dir.mkdir(parents=True, exist_ok=True)
out = out_dir / out_name
fig.savefig(out, dpi=120)
plt.close(fig)
return str(out)
def emit_stage_breakdown(
sweep_json: Path | str = DEFAULT_SWEEP_JSON,
out_dir: Path | str = DEFAULT_PLOTS_DIR,
) -> str | None:
"""Per-stage engine wall-clock per shape (load_ref operand staging)."""
data = _load_sweep_data(sweep_json)
rows = [r for r in data["rows"] if r.get("variant") == _PLOT_VARIANT]
if not rows:
return None
tile = data["tile_sizes"]
shape_labels = [_shape_label(r) for r in rows]
flagged = [_under_tile(r["M"], r["K"], r["N"], tile["M"], tile["K"], tile["N"])
for r in rows]
tile_counts = [r["tile_count_expected"] for r in rows]
series = {
STAGE_DISPLAY[s]: [r.get("stages", {}).get(s, {}).get("wall_ns", 0.0)
for r in rows]
for s in STAGE_KEYS
}
colors = {STAGE_DISPLAY[s]: STAGE_COLORS[s] for s in STAGE_KEYS}
return _grouped_bar_png(
"gemm_stage_breakdown.png", out_dir=Path(out_dir),
title="GEMM stage breakdown",
subtitle=(f"Per-stage engine wall-clock (DMA in / Fetch / GEMM / "
f"DMA out), {_PLOT_VARIANT} staging. "
f"Tile {tile['M']}x{tile['K']}x{tile['N']}."),
shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
series=series, colors=colors, y_label="ns",
footnote="Bars = engine wall-clock interval (merged overlaps).",
)
def emit_mac_utilization_measured(
sweep_json: Path | str = DEFAULT_SWEEP_JSON,
out_dir: Path | str = DEFAULT_PLOTS_DIR,
) -> str | None:
"""GEMM util % and useful pipeline-eff % (analytical model, load_ref)."""
data = _load_sweep_data(sweep_json)
rows = data["rows"]
if not rows:
return None
tile = data["tile_sizes"]
TILE_M, TILE_K, TILE_N = tile["M"], tile["K"], tile["N"]
tile_flops = 2 * TILE_M * TILE_K * TILE_N
dma_w_per_pair = (TILE_M * TILE_N * _BPE) / _HBM_GBS
head_ns = (_D_STAGES - 1) * _T_STAGE
by_shape = {(r["M"], r["K"], r["N"]): r
for r in rows if r["variant"] == _PLOT_VARIANT}
shapes = list(by_shape)
if not shapes:
return None
shape_labels = [_shape_label(by_shape[k]) for k in shapes]
flagged = [_under_tile(*k, TILE_M, TILE_K, TILE_N) for k in shapes]
tile_counts = [by_shape[k]["tile_count_expected"] for k in shapes]
gemm_util, useful_eff = [], []
for k in shapes:
r = by_shape[k]
M, K, N = r["M"], r["K"], r["N"]
useful = 2 * M * K * N
tiles = r["tile_count_expected"]
gu = useful / (tile_flops * tiles) * 100
gemm_util.append(gu)
m_tiles = (M + TILE_M - 1) // TILE_M
n_tiles = (N + TILE_N - 1) // TILE_N
n_mn = m_tiles * n_tiles
compute_total = tiles * _T_STAGE
wall = head_ns + tiles * _T_STAGE + max(0, n_mn - 1) * dma_w_per_pair
ueff = (compute_total * (gu / 100.0) / wall) * 100 if wall > 0 else 0.0
useful_eff.append(ueff)
series = {"GEMM util %": gemm_util, "Useful eff %": useful_eff}
colors = {"GEMM util %": "#10B981", "Useful eff %": "#F59E0B"}
return _grouped_bar_png(
"gemm_mac_utilization_measured.png", out_dir=Path(out_dir),
title="GEMM MAC utilization — load_ref",
subtitle=("GEMM util = useful FLOPs / (tile FLOPs x tiles); "
"Useful eff = GEMM util x ideal pipeline efficiency."),
shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
series=series, colors=colors, y_label="%", threshold=100.0,
footnote="Theoretical ideal-pipeline model (not simulator data).",
)
def emit_mac_utilization_theoretical_vs_measured(
sweep_json: Path | str = DEFAULT_SWEEP_JSON,
out_dir: Path | str = DEFAULT_PLOTS_DIR,
) -> str | None:
"""Theoretical vs simulator-measured GEMM util / useful eff (load_ref)."""
data = _load_sweep_data(sweep_json)
rows = data["rows"]
if not rows:
return None
tile = data["tile_sizes"]
TILE_M, TILE_K, TILE_N = tile["M"], tile["K"], tile["N"]
tile_flops = 2 * TILE_M * TILE_K * TILE_N
dma_w_per_pair = (TILE_M * TILE_N * _BPE) / _HBM_GBS
head_ns = (_D_STAGES - 1) * _T_STAGE
peak_per_ns = tile_flops / _T_STAGE
by_shape = {(r["M"], r["K"], r["N"]): r
for r in rows if r["variant"] == _PLOT_VARIANT}
shapes = list(by_shape)
if not shapes:
return None
shape_labels = [_shape_label(by_shape[k]) for k in shapes]
flagged = [_under_tile(*k, TILE_M, TILE_K, TILE_N) for k in shapes]
tile_counts = [by_shape[k]["tile_count_expected"] for k in shapes]
gu_t, gu_m, eff_t, eff_m = [], [], [], []
for k in shapes:
r = by_shape[k]
M, K, N = r["M"], r["K"], r["N"]
useful = 2 * M * K * N
tiles = r["tile_count_expected"]
gut = useful / (tile_flops * tiles)
gu_t.append(gut * 100)
rec = r.get("stages", {}).get("GEMM", {}).get("record_count", 0) or tiles
gu_m.append((useful / (tile_flops * rec) * 100) if rec else 0.0)
m_tiles = (M + TILE_M - 1) // TILE_M
n_tiles = (N + TILE_N - 1) // TILE_N
n_mn = m_tiles * n_tiles
compute_total = tiles * _T_STAGE
wall_t = head_ns + compute_total + max(0, n_mn - 1) * dma_w_per_pair
eff_t.append((compute_total * gut / wall_t * 100) if wall_t > 0 else 0.0)
cw = r.get("composite_window_ns", 0.0) or 0.0
eff_m.append((useful / cw / peak_per_ns * 100) if cw > 0 else 0.0)
series = {
"GEMM util % (theoretical)": gu_t,
"GEMM util % (measured)": gu_m,
"Theoretical eff %": eff_t,
"Measured eff %": eff_m,
}
colors = {
"GEMM util % (theoretical)": "#10B981",
"GEMM util % (measured)": "#6EE7B7",
"Theoretical eff %": "#F59E0B",
"Measured eff %": "#3B82F6",
}
return _grouped_bar_png(
"gemm_mac_utilization_theoretical_vs_measured.png", out_dir=Path(out_dir),
title="GEMM MAC utilization — theoretical vs measured (load_ref)",
subtitle=("theoretical model vs simulator op_log; agreement "
"validates the analytical pipeline model."),
shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
series=series, colors=colors, y_label="%", threshold=100.0,
)
def emit_all_gemm_plots(
sweep_json: Path | str = DEFAULT_SWEEP_JSON,
out_dir: Path | str = DEFAULT_PLOTS_DIR,
) -> list[str]:
"""Render every GEMM figure that has data; return the paths written."""
paths = []
for fn in (emit_stage_breakdown,
emit_mac_utilization_measured,
emit_mac_utilization_theoretical_vs_measured):
p = fn(sweep_json, out_dir)
if p:
paths.append(p)
return paths
# ── bench entry ────────────────────────────────────────────────────────
@bench(
name="milestone-1h-gemm",
description="1H milestone: regenerate all GEMM results + figures.",
)
def run(torch) -> None:
"""Run the GEMM sweep (or reuse committed JSON) and render every figure.
``MILESTONE_FAST=1`` reuses the committed ``DEFAULT_SWEEP_JSON`` (seconds);
otherwise the full sweep runs into ``out_dir/gemm_sweep.json`` (minutes).
The sweep drives its own engines, so a sentinel tensor is submitted at the
end to satisfy the run_bench contract (ADR-0045 D4).
"""
_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
fast = bool(os.environ.get("MILESTONE_FAST"))
if fast:
sweep_json = DEFAULT_SWEEP_JSON
else:
sweep_json = run_sweep(out_json=_OUTPUT_DIR / "gemm_sweep.json")
paths = emit_all_gemm_plots(sweep_json=sweep_json, out_dir=_OUTPUT_DIR)
print(f" milestone-1h-gemm: {len(paths)} figures -> {_OUTPUT_DIR} "
f"(fast={fast})")
torch.zeros(
(1, 1), dtype="f16",
dp=DPPolicy(cube="row_wise", pe="replicate", num_cubes=1, num_pes=1),
name="milestone_gemm_sentinel",
)
+23 -7
View File
@@ -609,6 +609,7 @@ class RuntimeContext:
kernel_fn: Any, kernel_fn: Any,
*args: Any, *args: Any,
_defer_wait: bool = False, _defer_wait: bool = False,
_auto_dim_remap: bool = True,
**kwargs: Any, **kwargs: Any,
) -> RequestHandle: ) -> RequestHandle:
"""Register and launch a kernel (like a fused torch op). """Register and launch a kernel (like a fused torch op).
@@ -700,21 +701,36 @@ class RuntimeContext:
return t.shape return t.shape
# ADR-0026: DPPolicy no longer crosses SIP boundaries; cube + PE # ADR-0026: DPPolicy no longer crosses SIP boundaries; cube + PE
# are the only axes that shrink the local shape. # are the only axes that shrink the local shape.
# Mirror the tensor allocator's precedence (context.py L471-484):
# DPPolicy.num_cubes overrides the topology's cube count when set.
# Without this, multi_user panels at validation scale
# (DPPolicy.num_cubes=4) get sharded as if the topology's full
# cube count (16) applied — see test_launch_dim_translation.py.
if dp.cube != "replicate": if dp.cube != "replicate":
eff_num_cubes = (
dp.num_cubes if dp.num_cubes is not None else self._num_cubes
)
if dp.cube == "column_wise": if dp.cube == "column_wise":
K = K // self._num_cubes K = K // eff_num_cubes
elif dp.cube == "row_wise": elif dp.cube == "row_wise":
M = M // self._num_cubes M = M // eff_num_cubes
if len(t.shape) < 2: if len(t.shape) < 2:
return (K,) return (K,)
return (M, K) return (M, K)
# Auto-dim-remap (opt-out via _auto_dim_remap=False). Legacy
# kernels (e.g. va_offset bench) pass global dims as scalars and
# rely on launch to rewrite them to local. Mesh attention kernels
# already receive cube-local dims (S_kv_per_rank, d_head, …) and
# opt out — the remap would otherwise collide d_head=64 with K's
# global M=64 and rewrite d_head. See test_launch_dim_translation.py.
dim_map: dict[int, int] = {} # global_dim → local_dim dim_map: dict[int, int] = {} # global_dim → local_dim
for t in tensor_args: if _auto_dim_remap:
local = _compute_local_shape(t) for t in tensor_args:
for g, l in zip(t.shape if len(t.shape) >= 2 else (1, t.shape[0]), local if len(local) >= 2 else (1, local[0])): local = _compute_local_shape(t)
if g != l: for g, l in zip(t.shape if len(t.shape) >= 2 else (1, t.shape[0]), local if len(local) >= 2 else (1, local[0])):
dim_map[g] = l if g != l:
dim_map[g] = l
# Per-SIP kernel launch: each SIP gets TensorArgs with local va_base # Per-SIP kernel launch: each SIP gets TensorArgs with local va_base
last_handle = None last_handle = None
+61 -12
View File
@@ -25,6 +25,37 @@ class DataExecutor:
def __init__(self, op_log: list[OpRecord], store: MemoryStore) -> None: def __init__(self, op_log: list[OpRecord], store: MemoryStore) -> None:
self._op_log = op_log self._op_log = op_log
self.store = store self.store = store
# Per-slot time-ordered shape-keyed history. Populated on every
# ipcq_copy WRITE; consulted on reads that find a shape-mismatched
# value in MemoryStore (the slot was wrapped by a later inbound
# before this read's Phase 2 turn). Required because Phase 1 cannot
# snapshot math-output sources at outbound time (math executes only
# in Phase 2), so token.data is None and slot wraps lose the recv-
# time value. See test_attention_mesh_decode_diag (ADR-0059 mesh).
self._slot_history: dict[tuple[str, int], list[tuple[float, Any]]] = {}
def _resolve_read(
self, space: str, addr: int,
shape: tuple[int, ...] | None, dtype: str | None,
t_at_or_before: float,
) -> Any:
"""Read (space, addr) with expected shape. On KeyError or shape
mismatch in MemoryStore, fall back to ``_slot_history`` for the
most recent shape-matching entry with t_write <= t_at_or_before.
Returns None when no match is found."""
try:
return self.store.read(space, addr, shape=shape, dtype=dtype)
except (KeyError, ValueError):
pass
hist = self._slot_history.get((space, addr))
if hist is None:
return None
for t_w, val in reversed(hist):
if t_w > t_at_or_before:
continue
if shape is None or getattr(val, "shape", None) == shape:
return val
return None
# Ordering priority within the same t_start: memory copies must run # Ordering priority within the same t_start: memory copies must run
# before math/gemm so that slot data is populated before a consumer # before math/gemm so that slot data is populated before a consumer
@@ -87,14 +118,23 @@ class DataExecutor:
# only get populated by Phase 2's math replay). # only get populated by Phase 2's math replay).
data = p.get("snapshot") data = p.get("snapshot")
if data is None: if data is None:
try: data = self._resolve_read(
data = self.store.read( src_space, src_addr,
src_space, src_addr, p.get("shape"), p.get("dtype"), op.t_start,
shape=p.get("shape"), dtype=p.get("dtype"), )
) if data is None:
except KeyError:
return return
self.store.write(dst_space, dst_addr, data) self.store.write(dst_space, dst_addr, data)
# Record this write in slot history so a later forwarded read
# at src=dst_addr (a different ipcq_copy whose src is this slot)
# can recover by shape even after the slot has been wrapped.
if op.op_name == "ipcq_copy":
self._slot_history.setdefault(
(dst_space, dst_addr), [],
).append((
op.t_start,
data.copy() if hasattr(data, "copy") else data,
))
def _execute_gemm(self, op: OpRecord) -> None: def _execute_gemm(self, op: OpRecord) -> None:
"""Execute GEMM: out = a @ b.""" """Execute GEMM: out = a @ b."""
@@ -110,10 +150,16 @@ class DataExecutor:
dtype_in = p.get("dtype_in", "f16") dtype_in = p.get("dtype_in", "f16")
dtype_out = p.get("dtype_out", dtype_in) dtype_out = p.get("dtype_out", dtype_in)
a = self.store.read(src_a_space, p["src_a_addr"], a = self._resolve_read(src_a_space, p["src_a_addr"],
shape=p.get("shape_a"), dtype=dtype_in) p.get("shape_a"), dtype_in, op.t_start)
b = self.store.read(src_b_space, p["src_b_addr"], if a is None:
shape=p.get("shape_b"), dtype=dtype_in) a = self.store.read(src_a_space, p["src_a_addr"],
shape=p.get("shape_a"), dtype=dtype_in)
b = self._resolve_read(src_b_space, p["src_b_addr"],
p.get("shape_b"), dtype_in, op.t_start)
if b is None:
b = self.store.read(src_b_space, p["src_b_addr"],
shape=p.get("shape_b"), dtype=dtype_in)
# Compute in higher precision if specified # Compute in higher precision if specified
dtype_acc = p.get("dtype_acc", "f32") dtype_acc = p.get("dtype_acc", "f32")
@@ -150,8 +196,11 @@ class DataExecutor:
): ):
if snap is not None: if snap is not None:
inputs.append(snap) inputs.append(snap)
else: continue
inputs.append(self.store.read(space, addr, shape=shape, dtype=idtype)) resolved = self._resolve_read(space, addr, shape, idtype, op.t_start)
if resolved is None:
resolved = self.store.read(space, addr, shape=shape, dtype=idtype)
inputs.append(resolved)
result = _compute_math(math_op, inputs, p.get("axis")) result = _compute_math(math_op, inputs, p.get("axis"))
if result is not None: if result is not None:
+11 -2
View File
@@ -96,13 +96,20 @@ class OpLogger:
# gets reused on the next ring round). # gets reused on the next ring round).
if self._memory_store is not None: if self._memory_store is not None:
if op_kind == "math": if op_kind == "math":
handle_snaps = params.get("input_handle_data") or ()
snaps: list[Any] = [] snaps: list[Any] = []
for addr, shape, space, idtype in zip( for i, (addr, shape, space, idtype) in enumerate(zip(
params.get("input_addrs", []), params.get("input_addrs", []),
params.get("input_shapes", []), params.get("input_shapes", []),
params.get("input_spaces", []), params.get("input_spaces", []),
params.get("input_dtypes", []), params.get("input_dtypes", []),
): )):
if i < len(handle_snaps) and handle_snaps[i] is not None:
carried = handle_snaps[i]
snaps.append(
carried.copy() if hasattr(carried, "copy") else carried
)
continue
try: try:
arr = self._memory_store.read( arr = self._memory_store.read(
space, addr, shape=shape, dtype=idtype, space, addr, shape=shape, dtype=idtype,
@@ -111,6 +118,7 @@ class OpLogger:
except Exception: except Exception:
snaps.append(None) snaps.append(None)
params["input_snapshots"] = snaps params["input_snapshots"] = snaps
params.pop("input_handle_data", None)
elif op_name == "dma_write": elif op_name == "dma_write":
# ADR-0027 fix: only snapshot HBM sources. TCM (PE scratch) # ADR-0027 fix: only snapshot HBM sources. TCM (PE scratch)
# sources are repopulated by Phase 2 math/gemm replay — # sources are repopulated by Phase 2 math/gemm replay —
@@ -222,6 +230,7 @@ def _extract_op_info(msg: Any) -> tuple[str, str, dict[str, Any]]:
"input_shapes": [h.shape for h in msg.inputs], "input_shapes": [h.shape for h in msg.inputs],
"input_spaces": [getattr(h, "space", "tcm") for h in msg.inputs], "input_spaces": [getattr(h, "space", "tcm") for h in msg.inputs],
"input_dtypes": [h.dtype for h in msg.inputs], "input_dtypes": [h.dtype for h in msg.inputs],
"input_handle_data": tuple(getattr(h, "data", None) for h in msg.inputs),
"dst_addr": msg.out.addr, "dst_addr": msg.out.addr,
"dst_space": getattr(msg.out, "space", "tcm"), "dst_space": getattr(msg.out, "space", "tcm"),
"shape_out": msg.out.shape, "shape_out": msg.out.shape,
@@ -0,0 +1,198 @@
"""End-to-end engine drives for the four GQA Llama-70B panels (sub-cycle 4c step 2).
Mirrors the existing single_user_decode diag harness across all four panels
of the milestone-gqa-llama70b sweep (ADR-0057):
single_user_prefill ring-K/V kernel, intracube PE ring (8 PEs / 1 cube)
single_user_decode allreduce-mlo kernel, intracube PE ring
multi_user_prefill ring-K/V kernel, intercube multisip (4 cubes)
multi_user_decode allreduce-mlo kernel, intercube multisip
Each test runs the panel through ``run_bench`` with ``enable_data=True``
and asserts ``result.completion.ok``. Failures dump the engine's op_log
tail and the exception, mirroring the decode-diag harness format.
Validation-scale config matches ADR-0057 D4:
S_q_prefill=16, S_kv_per_rank=16, h_q=h_kv=1, d_head=64
n_ranks_single_user=8, n_ranks_multi_user=4
"""
from __future__ import annotations
import traceback
from pathlib import Path
import pytest
from kernbench.benches._attention_mesh_kv import attention_mesh_kv_kernel
from kernbench.benches._attention_mesh_mlo import attention_mesh_mlo_kernel
from kernbench.ccl.install import load_ccl_config, resolve_algorithm_config
from kernbench.ccl.sfr_config import (
configure_sfr_intercube_multisip,
configure_sfr_intracube_pe_ring,
)
from kernbench.policy.placement.dp import DPPolicy
from kernbench.runtime_api.bench_runner import run_bench
from kernbench.runtime_api.types import resolve_device
from kernbench.sim_engine.engine import GraphEngine
from kernbench.topology.builder import resolve_topology
TOPOLOGY_PATH = Path(__file__).resolve().parents[2] / "topology.yaml"
S_Q_PREFILL = 16
S_Q_DECODE = 1
S_KV_PER_RANK = 16
H_Q = 1
H_KV = 1
D_HEAD = 64
N_RANKS_SINGLE_USER = 8
N_RANKS_MULTI_USER = 4
DTYPE = "f16"
# ── Helpers ──────────────────────────────────────────────────────
def _engine_factory(t, d):
return GraphEngine(getattr(t, "topology_obj", t), enable_data=True)
def _run_panel(bench_fn):
"""Drive a panel through run_bench; return (exc, result, engine)."""
topo = resolve_topology(str(TOPOLOGY_PATH))
captured: dict = {"engine": None}
def factory(t, d):
eng = _engine_factory(t, d)
captured["engine"] = eng
return eng
exc = None
result = None
try:
result = run_bench(
topology=topo, bench_fn=bench_fn,
device=resolve_device(None), engine_factory=factory,
)
except BaseException as e: # noqa: BLE001
exc = e
return exc, result, captured["engine"]
def _assert_ok(name: str, exc, result, engine) -> None:
if exc is not None:
oplog_len = len(getattr(engine, "op_log", []) or []) if engine else 0
print(f"\n========== {name} FAIL ==========")
print(f"op_log records before crash: {oplog_len}")
print(f"{type(exc).__name__}: {exc}")
traceback.print_exception(type(exc), exc, exc.__traceback__)
raise AssertionError(
f"{name} failed at runtime: {exc}"
) from exc
assert result is not None, f"{name}: no result"
assert result.completion.ok, f"{name}: completion not ok — {result.completion}"
# ── Panel bench fns ──────────────────────────────────────────────
def _bench_fn_single_user_prefill(ctx):
configure_sfr_intracube_pe_ring(
ctx.engine, ctx.spec,
resolve_algorithm_config(load_ccl_config(), name="lrab_hierarchical_allreduce"),
)
n = N_RANKS_SINGLE_USER
dp_full = DPPolicy(cube="replicate", pe="replicate", num_cubes=1, num_pes=n)
dp_kv = DPPolicy(cube="replicate", pe="row_wise", num_cubes=1, num_pes=n)
q = ctx.zeros((S_Q_PREFILL, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="q")
k = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="k")
v = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="v")
o = ctx.empty((S_Q_PREFILL, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="o")
ctx.launch(
"single_user_prefill_mesh", attention_mesh_kv_kernel,
q, k, v, o,
S_Q_PREFILL, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, n,
)
def _bench_fn_single_user_decode(ctx):
configure_sfr_intracube_pe_ring(
ctx.engine, ctx.spec,
resolve_algorithm_config(load_ccl_config(), name="lrab_hierarchical_allreduce"),
)
n = N_RANKS_SINGLE_USER
dp_full = DPPolicy(cube="replicate", pe="replicate", num_cubes=1, num_pes=n)
dp_kv = DPPolicy(cube="replicate", pe="row_wise", num_cubes=1, num_pes=n)
q = ctx.zeros((S_Q_DECODE, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="q")
k = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="k")
v = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="v")
o = ctx.empty((S_Q_DECODE, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="o")
ctx.launch(
"single_user_decode_mesh", attention_mesh_mlo_kernel,
q, k, v, o,
S_Q_DECODE, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, n,
)
def _bench_fn_multi_user_prefill(ctx):
configure_sfr_intercube_multisip(
ctx.engine, ctx.spec,
resolve_algorithm_config(load_ccl_config(), name="lrab_hierarchical_allreduce"),
)
n = N_RANKS_MULTI_USER
dp_full = DPPolicy(cube="replicate", pe="replicate", num_cubes=n, num_pes=8)
dp_kv = DPPolicy(cube="row_wise", pe="replicate", num_cubes=n, num_pes=8)
q = ctx.zeros((S_Q_PREFILL, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="q")
k = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="k")
v = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="v")
o = ctx.empty((S_Q_PREFILL, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="o")
ctx.launch(
"multi_user_prefill_mesh", attention_mesh_kv_kernel,
q, k, v, o,
S_Q_PREFILL, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, n,
1, # rank_axis=1 → ring at cube level (ADR-0059 multi_user)
_auto_dim_remap=False,
)
def _bench_fn_multi_user_decode(ctx):
configure_sfr_intercube_multisip(
ctx.engine, ctx.spec,
resolve_algorithm_config(load_ccl_config(), name="lrab_hierarchical_allreduce"),
)
n = N_RANKS_MULTI_USER
dp_full = DPPolicy(cube="replicate", pe="replicate", num_cubes=n, num_pes=8)
dp_kv = DPPolicy(cube="row_wise", pe="replicate", num_cubes=n, num_pes=8)
q = ctx.zeros((S_Q_DECODE, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="q")
k = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="k")
v = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="v")
o = ctx.empty((S_Q_DECODE, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="o")
ctx.launch(
"multi_user_decode_mesh", attention_mesh_mlo_kernel,
q, k, v, o,
S_Q_DECODE, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, n,
1, # rank_axis=1 → ring at cube level (ADR-0059 multi_user)
_auto_dim_remap=False,
)
# ── Tests ────────────────────────────────────────────────────────
def test_single_user_prefill_through_engine():
exc, result, engine = _run_panel(_bench_fn_single_user_prefill)
_assert_ok("single_user_prefill", exc, result, engine)
def test_single_user_decode_through_engine():
exc, result, engine = _run_panel(_bench_fn_single_user_decode)
_assert_ok("single_user_decode", exc, result, engine)
def test_multi_user_prefill_through_engine():
exc, result, engine = _run_panel(_bench_fn_multi_user_prefill)
_assert_ok("multi_user_prefill", exc, result, engine)
def test_multi_user_decode_through_engine():
exc, result, engine = _run_panel(_bench_fn_multi_user_decode)
_assert_ok("multi_user_decode", exc, result, engine)
@@ -0,0 +1,172 @@
"""Phase 1 spec test for ``rank_axis`` parameter on the two mesh kernels.
ADR-0059's mesh kernels currently hard-code ``rank = tl.program_id(axis=0)``,
which only works for single_user_* panels (rank == pe_id within cube).
For multi_user_* panels the ring is at the cube level rank should be
``cube_id`` (axis=1), and the 7 non-rank-leader PEs in each cube should
not run the ring (they only hold KV replicas).
This test pins the desired ``rank_axis`` kwarg semantics:
rank_axis = 0 (default, single_user)
rank = tl.program_id(axis=0). Every PE in the cube runs the ring.
Existing behavior no change.
rank_axis = 1 (multi_user)
if tl.program_id(axis=0) != 0: return. (7/8 PEs early-exit.)
rank = tl.program_id(axis=1).
Phase 1 expectation: tests fail today (kernels don't accept the kwarg).
Phase 2 lands the parameter on both kernels; tests turn green and the
multi_user_* diag harness clears its first send.
"""
from __future__ import annotations
from kernbench.common.ipcq_types import IpcqRecvCmd, IpcqSendCmd
from kernbench.common.pe_commands import GemmCmd
from kernbench.triton_emu.tl_context import TLContext, run_kernel
from kernbench.benches._attention_mesh_kv import attention_mesh_kv_kernel
from kernbench.benches._attention_mesh_mlo import attention_mesh_mlo_kernel
S_Q_PREFILL = 16
S_Q_DECODE = 1
S_KV_PER_RANK = 16
H_Q = 1
H_KV = 1
D_HEAD = 64
N_RANKS_MULTI = 4
PES_PER_CUBE = 8
Q_PTR = 0x10000
K_PTR = 0x20000
V_PTR = 0x30000
O_PTR = 0x40000
def _tl(pe_id: int, cube_id: int, num_pes: int, num_cubes: int) -> TLContext:
return TLContext(
pe_id=pe_id,
num_programs=num_pes,
cube_id=cube_id,
num_cubes=num_cubes,
dispatch_cycles=0,
scratch_base=0x80000,
scratch_size=1 << 20,
)
# ── Default rank_axis=0 backward-compat ──────────────────────────
def test_mlo_kernel_default_rank_axis_zero_emits_commands_on_all_pes():
"""rank_axis defaults to 0 → kernel uses pe_id as rank, runs on every
PE. Verify by running rank=3 (interior PE) in a single-cube 8-rank
setup and asserting at least one GEMM and at least one IPCQ send
are emitted (interior ranks send in both directions)."""
tl = _tl(pe_id=3, cube_id=0, num_pes=8, num_cubes=1)
run_kernel(
attention_mesh_mlo_kernel, tl,
Q_PTR, K_PTR, V_PTR, O_PTR,
S_Q_DECODE, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, 8,
)
assert any(isinstance(c, GemmCmd) for c in tl.commands), \
"default rank_axis=0 must run the kernel (≥1 GEMM)"
assert any(isinstance(c, IpcqSendCmd) for c in tl.commands), \
"interior rank must emit ≥1 IpcqSendCmd"
def test_kv_kernel_default_rank_axis_zero_emits_commands_on_all_pes():
tl = _tl(pe_id=3, cube_id=0, num_pes=8, num_cubes=1)
run_kernel(
attention_mesh_kv_kernel, tl,
Q_PTR, K_PTR, V_PTR, O_PTR,
S_Q_PREFILL, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, 8,
)
assert any(isinstance(c, GemmCmd) for c in tl.commands)
assert any(isinstance(c, IpcqSendCmd) for c in tl.commands)
# ── rank_axis=1 multi_user semantics ─────────────────────────────
def test_mlo_kernel_rank_axis_one_gates_non_zero_pe_to_no_commands():
"""rank_axis=1 + pe_id != 0 → kernel must early-return; no GEMM,
no DMA, no IPCQ. The 7 non-rank-leader PEs in a multi_user cube
must stay completely silent so the cube-level SFR install isn't
asked to route sends from PEs that have no neighbors installed."""
tl = _tl(pe_id=2, cube_id=1, num_pes=PES_PER_CUBE, num_cubes=N_RANKS_MULTI)
run_kernel(
attention_mesh_mlo_kernel, tl,
Q_PTR, K_PTR, V_PTR, O_PTR,
S_Q_DECODE, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, N_RANKS_MULTI,
rank_axis=1,
)
assert not any(isinstance(c, GemmCmd) for c in tl.commands), \
"pe_id=2 with rank_axis=1 must not emit GEMMs"
assert not any(isinstance(c, IpcqSendCmd) for c in tl.commands), \
"pe_id=2 with rank_axis=1 must not emit IpcqSendCmd"
assert not any(isinstance(c, IpcqRecvCmd) for c in tl.commands), \
"pe_id=2 with rank_axis=1 must not emit IpcqRecvCmd"
def test_kv_kernel_rank_axis_one_gates_non_zero_pe_to_no_commands():
tl = _tl(pe_id=2, cube_id=1, num_pes=PES_PER_CUBE, num_cubes=N_RANKS_MULTI)
run_kernel(
attention_mesh_kv_kernel, tl,
Q_PTR, K_PTR, V_PTR, O_PTR,
S_Q_PREFILL, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, N_RANKS_MULTI,
rank_axis=1,
)
assert not any(isinstance(c, GemmCmd) for c in tl.commands)
assert not any(isinstance(c, IpcqSendCmd) for c in tl.commands)
assert not any(isinstance(c, IpcqRecvCmd) for c in tl.commands)
def test_mlo_kernel_rank_axis_one_pe_zero_uses_cube_id_as_rank():
"""rank_axis=1 + pe_id == 0 → kernel runs the ring with rank=cube_id.
For cube_id=1 in a 4-cube ring, rank=1 is an interior rank: has_E=True
AND has_W=True IPCQ sends emitted in both E and W directions.
"""
tl = _tl(pe_id=0, cube_id=1, num_pes=PES_PER_CUBE, num_cubes=N_RANKS_MULTI)
run_kernel(
attention_mesh_mlo_kernel, tl,
Q_PTR, K_PTR, V_PTR, O_PTR,
S_Q_DECODE, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, N_RANKS_MULTI,
rank_axis=1,
)
sends = [c for c in tl.commands if isinstance(c, IpcqSendCmd)]
assert any(s.direction == "E" for s in sends), \
"cube_id=1 (interior) must emit ≥1 E-send"
assert any(s.direction == "W" for s in sends), \
"cube_id=1 (interior) must emit ≥1 W-send"
def test_kv_kernel_rank_axis_one_pe_zero_uses_cube_id_as_rank():
tl = _tl(pe_id=0, cube_id=1, num_pes=PES_PER_CUBE, num_cubes=N_RANKS_MULTI)
run_kernel(
attention_mesh_kv_kernel, tl,
Q_PTR, K_PTR, V_PTR, O_PTR,
S_Q_PREFILL, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, N_RANKS_MULTI,
rank_axis=1,
)
sends = [c for c in tl.commands if isinstance(c, IpcqSendCmd)]
assert any(s.direction == "E" for s in sends)
assert any(s.direction == "W" for s in sends)
def test_mlo_kernel_rank_axis_one_west_edge_cube_no_west_sends():
"""cube_id=0 (west edge) with rank_axis=1: rank=0, has_W=False → no
W-direction IPCQ sends. has_E=True 1 E-direction send."""
tl = _tl(pe_id=0, cube_id=0, num_pes=PES_PER_CUBE, num_cubes=N_RANKS_MULTI)
run_kernel(
attention_mesh_mlo_kernel, tl,
Q_PTR, K_PTR, V_PTR, O_PTR,
S_Q_DECODE, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, N_RANKS_MULTI,
rank_axis=1,
)
sends = [c for c in tl.commands if isinstance(c, IpcqSendCmd)]
assert any(s.direction == "E" for s in sends), \
"west-edge cube_id=0 must still emit ≥1 E-send"
assert not any(s.direction == "W" for s in sends), \
"west-edge cube_id=0 must NOT emit any W-send (no W neighbor)"
+25 -277
View File
@@ -1,283 +1,31 @@
"""Shared plotting plumbing for the GEMM figure tests. """Thin re-export shim for the GEMM figure tests.
Not a test module (no ``test_`` prefix -> pytest does not collect it). Not a test module (no ``test_`` prefix pytest does not collect it).
Reads the committed ``docs/diagrams/gemm_sweep.json`` (produced by the heavy The sweep + renderer logic now lives in
``scripts/gemm_sweep.py`` sim sweep) and renders matplotlib PNGs into ``kernbench.benches.milestone_1h_gemm`` (production single home, ADR-0054,
``docs/diagrams/gemm_plots/``. No simulation here -> the figure tests are fast also driven by ``scripts/gemm_sweep.py``). The figure tests import the same
and run by default; regenerating the underlying data stays a manual script. names from here; behavior is unchanged (defaults still target
``docs/diagrams/gemm_plots/``).
Chart set (mirrors the GEMM MAC slides in scripts/build_overview_slides.py):
- stage breakdown (load_ref operand staging)
- MAC utilization measured (load_ref)
- MAC utilization theoretical vs measured (load_ref)
""" """
from __future__ import annotations from __future__ import annotations
import json from kernbench.benches.milestone_1h_gemm import (
from pathlib import Path DEFAULT_PLOTS_DIR as GEMM_PLOTS_DIR,
DEFAULT_SWEEP_JSON as GEMM_SWEEP_JSON,
ROOT,
emit_all_gemm_plots,
emit_mac_utilization_measured,
emit_mac_utilization_theoretical_vs_measured,
emit_stage_breakdown,
)
ROOT = Path(__file__).resolve().parent.parent.parent __all__ = [
GEMM_SWEEP_JSON = ROOT / "docs" / "diagrams" / "gemm_sweep.json" "GEMM_PLOTS_DIR",
GEMM_PLOTS_DIR = ROOT / "docs" / "diagrams" / "gemm_plots" "GEMM_SWEEP_JSON",
"ROOT",
# Shapes excluded from the figures (mirrors build_overview_slides). "emit_all_gemm_plots",
EXCLUDED_SHAPES = {(512, 512, 512)} "emit_mac_utilization_measured",
"emit_mac_utilization_theoretical_vs_measured",
# Stage bars shown (raw op_log stage_type keys) + display names + colors. "emit_stage_breakdown",
STAGE_KEYS = ["DMA_READ", "FETCH", "GEMM", "DMA_WRITE"] ]
STAGE_DISPLAY = {
"DMA_READ": "DMA in",
"FETCH": "Fetch",
"GEMM": "GEMM",
"DMA_WRITE": "DMA out",
}
STAGE_COLORS = {
"DMA_READ": "#3B82F6",
"FETCH": "#10B981",
"GEMM": "#F59E0B",
"DMA_WRITE": "#A855F7",
}
# MAC-utilization model constants (mirror build_overview_slides).
_HBM_GBS = 256.0
_BPE = 2
_T_STAGE = 16.0
_D_STAGES = 3
_PLOT_VARIANT = "load_ref"
def _load_sweep_data() -> dict:
if not GEMM_SWEEP_JSON.exists():
return {"rows": []}
data = json.loads(GEMM_SWEEP_JSON.read_text())
data["rows"] = [
r for r in data.get("rows", [])
if (r["M"], r["K"], r["N"]) not in EXCLUDED_SHAPES
]
return data
def _shape_label(r: dict) -> str:
if r["M"] == r["K"] == r["N"]:
return f"M=K=N={r['M']}"
return f"M={r['M']} K={r['K']} N={r['N']}"
def _under_tile(M, K, N, tile_M, tile_K, tile_N) -> bool:
return M < tile_M or K < tile_K or N < tile_N
def _xtick_labels(shape_labels, tile_counts, flagged) -> list[str]:
out = []
for lbl, tc, fl in zip(shape_labels, tile_counts, flagged):
s = f"{lbl}\n({tc} tiles)"
if fl:
s += " *"
out.append(s)
return out
def _grouped_bar_png(
out_name: str, *, title: str, subtitle: str | None,
shape_labels, tile_counts, flagged, series: dict, colors: dict,
y_label: str, threshold: float | None = None, footnote: str | None = None,
) -> str:
"""Render one grouped-bar chart to GEMM_PLOTS_DIR/out_name; return the path."""
import matplotlib.pyplot as plt
import numpy as np
n_groups = len(shape_labels)
n_series = max(1, len(series))
x = np.arange(n_groups)
width = 0.8 / n_series
fig, ax = plt.subplots(figsize=(11, 6))
for i, (name, vals) in enumerate(series.items()):
offset = (i - (n_series - 1) / 2) * width
ax.bar(x + offset, vals, width, label=name, color=colors.get(name))
ax.set_xticks(x)
ax.set_xticklabels(
_xtick_labels(shape_labels, tile_counts, flagged), fontsize=8,
)
ax.set_ylabel(y_label)
ax.set_title(title, fontsize=13, fontweight="bold")
if subtitle:
ax.text(0.5, 1.01, subtitle, transform=ax.transAxes, ha="center",
va="bottom", fontsize=8, color="#475569")
if threshold is not None:
ax.axhline(threshold, ls="--", color="gray", lw=1.0)
ax.legend(fontsize=8, loc="upper right")
ax.grid(True, axis="y", alpha=0.3)
caption = "* = under-tile shape (M<TILE_M, K<TILE_K, or N<TILE_N)"
if footnote:
caption = footnote + "\n" + caption
fig.text(0.5, 0.01, caption, ha="center", fontsize=7, color="gray",
wrap=True)
fig.tight_layout(rect=(0, 0.05, 1, 1))
GEMM_PLOTS_DIR.mkdir(parents=True, exist_ok=True)
out = GEMM_PLOTS_DIR / out_name
fig.savefig(out, dpi=120)
plt.close(fig)
return str(out)
# ── individual chart renderers (read sweep JSON, emit one PNG each) ─────
def emit_stage_breakdown() -> str | None:
"""Per-stage engine wall-clock per shape (load_ref operand staging)."""
data = _load_sweep_data()
rows = [r for r in data["rows"] if r.get("variant") == _PLOT_VARIANT]
if not rows:
return None
tile = data["tile_sizes"]
shape_labels = [_shape_label(r) for r in rows]
flagged = [_under_tile(r["M"], r["K"], r["N"], tile["M"], tile["K"], tile["N"])
for r in rows]
tile_counts = [r["tile_count_expected"] for r in rows]
series = {
STAGE_DISPLAY[s]: [r.get("stages", {}).get(s, {}).get("wall_ns", 0.0)
for r in rows]
for s in STAGE_KEYS
}
colors = {STAGE_DISPLAY[s]: STAGE_COLORS[s] for s in STAGE_KEYS}
return _grouped_bar_png(
"gemm_stage_breakdown.png",
title="GEMM stage breakdown",
subtitle=(f"Per-stage engine wall-clock (DMA in / Fetch / GEMM / "
f"DMA out), {_PLOT_VARIANT} staging. "
f"Tile {tile['M']}x{tile['K']}x{tile['N']}."),
shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
series=series, colors=colors, y_label="ns",
footnote="Bars = engine wall-clock interval (merged overlaps).",
)
def emit_mac_utilization_measured() -> str | None:
"""GEMM util % and useful pipeline-eff % (analytical model, load_ref)."""
data = _load_sweep_data()
rows = data["rows"]
if not rows:
return None
tile = data["tile_sizes"]
TILE_M, TILE_K, TILE_N = tile["M"], tile["K"], tile["N"]
tile_flops = 2 * TILE_M * TILE_K * TILE_N
dma_w_per_pair = (TILE_M * TILE_N * _BPE) / _HBM_GBS
head_ns = (_D_STAGES - 1) * _T_STAGE
by_shape = {(r["M"], r["K"], r["N"]): r
for r in rows if r["variant"] == _PLOT_VARIANT}
shapes = list(by_shape)
if not shapes:
return None
shape_labels = [_shape_label(by_shape[k]) for k in shapes]
flagged = [_under_tile(*k, TILE_M, TILE_K, TILE_N) for k in shapes]
tile_counts = [by_shape[k]["tile_count_expected"] for k in shapes]
gemm_util, useful_eff = [], []
for k in shapes:
r = by_shape[k]
M, K, N = r["M"], r["K"], r["N"]
useful = 2 * M * K * N
tiles = r["tile_count_expected"]
gu = useful / (tile_flops * tiles) * 100
gemm_util.append(gu)
m_tiles = (M + TILE_M - 1) // TILE_M
n_tiles = (N + TILE_N - 1) // TILE_N
n_mn = m_tiles * n_tiles
compute_total = tiles * _T_STAGE
wall = head_ns + tiles * _T_STAGE + max(0, n_mn - 1) * dma_w_per_pair
ueff = (compute_total * (gu / 100.0) / wall) * 100 if wall > 0 else 0.0
useful_eff.append(ueff)
series = {"GEMM util %": gemm_util, "Useful eff %": useful_eff}
colors = {"GEMM util %": "#10B981", "Useful eff %": "#F59E0B"}
return _grouped_bar_png(
"gemm_mac_utilization_measured.png",
title="GEMM MAC utilization — load_ref",
subtitle=("GEMM util = useful FLOPs / (tile FLOPs x tiles); "
"Useful eff = GEMM util x ideal pipeline efficiency."),
shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
series=series, colors=colors, y_label="%", threshold=100.0,
footnote="Theoretical ideal-pipeline model (not simulator data).",
)
def emit_mac_utilization_theoretical_vs_measured() -> str | None:
"""Theoretical vs simulator-measured GEMM util / useful eff (load_ref)."""
data = _load_sweep_data()
rows = data["rows"]
if not rows:
return None
tile = data["tile_sizes"]
TILE_M, TILE_K, TILE_N = tile["M"], tile["K"], tile["N"]
tile_flops = 2 * TILE_M * TILE_K * TILE_N
dma_w_per_pair = (TILE_M * TILE_N * _BPE) / _HBM_GBS
head_ns = (_D_STAGES - 1) * _T_STAGE
peak_per_ns = tile_flops / _T_STAGE
by_shape = {(r["M"], r["K"], r["N"]): r
for r in rows if r["variant"] == _PLOT_VARIANT}
shapes = list(by_shape)
if not shapes:
return None
shape_labels = [_shape_label(by_shape[k]) for k in shapes]
flagged = [_under_tile(*k, TILE_M, TILE_K, TILE_N) for k in shapes]
tile_counts = [by_shape[k]["tile_count_expected"] for k in shapes]
gu_t, gu_m, eff_t, eff_m = [], [], [], []
for k in shapes:
r = by_shape[k]
M, K, N = r["M"], r["K"], r["N"]
useful = 2 * M * K * N
tiles = r["tile_count_expected"]
gut = useful / (tile_flops * tiles)
gu_t.append(gut * 100)
rec = r.get("stages", {}).get("GEMM", {}).get("record_count", 0) or tiles
gu_m.append((useful / (tile_flops * rec) * 100) if rec else 0.0)
m_tiles = (M + TILE_M - 1) // TILE_M
n_tiles = (N + TILE_N - 1) // TILE_N
n_mn = m_tiles * n_tiles
compute_total = tiles * _T_STAGE
wall_t = head_ns + compute_total + max(0, n_mn - 1) * dma_w_per_pair
eff_t.append((compute_total * gut / wall_t * 100) if wall_t > 0 else 0.0)
cw = r.get("composite_window_ns", 0.0) or 0.0
eff_m.append((useful / cw / peak_per_ns * 100) if cw > 0 else 0.0)
series = {
"GEMM util % (theoretical)": gu_t,
"GEMM util % (measured)": gu_m,
"Theoretical eff %": eff_t,
"Measured eff %": eff_m,
}
colors = {
"GEMM util % (theoretical)": "#10B981",
"GEMM util % (measured)": "#6EE7B7",
"Theoretical eff %": "#F59E0B",
"Measured eff %": "#3B82F6",
}
return _grouped_bar_png(
"gemm_mac_utilization_theoretical_vs_measured.png",
title="GEMM MAC utilization — theoretical vs measured (load_ref)",
subtitle=("theoretical model vs simulator op_log; agreement "
"validates the analytical pipeline model."),
shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
series=series, colors=colors, y_label="%", threshold=100.0,
)
def emit_all_gemm_plots() -> list[str]:
"""Render every GEMM figure that has data; return the list of paths written."""
paths = []
for fn in (emit_stage_breakdown,
emit_mac_utilization_measured,
emit_mac_utilization_theoretical_vs_measured):
p = fn()
if p:
paths.append(p)
return paths
File diff suppressed because it is too large Load Diff
+131
View File
@@ -0,0 +1,131 @@
"""Phase 1 spec test for ``ctx.launch`` dim-translation bugs surfaced by
the multi_user_* panels of milestone-gqa-llama70b (sub-cycle 4c step 2).
The default ``topology.yaml`` has 4×4 = 16 cubes per SIP, so
``RuntimeContext._num_cubes == 16``. Multi-user attention panels run a
4-cube ring (validation scale) by passing ``DPPolicy(num_cubes=4)``.
Two bugs in ``ctx.launch`` make this combination silently produce wrong
kernel arguments:
Bug A _compute_local_shape ignores DPPolicy.num_cubes
``_compute_local_shape`` in ``ctx.launch`` divides by
``self._num_cubes`` (the topology's cube count, 16) instead of the
DPPolicy's effective ``num_cubes`` (4). So a ``(M=80, K=64)`` tensor
sharded ``cube="row_wise"`` with ``DPPolicy(num_cubes=4)`` produces
a local M of ``80 // 16 = 5``, not the kernel-expected ``80 // 4 = 20``.
Note: tensor allocation already honors ``dp.num_cubes`` correctly at
[context.py:471-484](src/kernbench/runtime_api/context.py#L471-L484);
the bug is the parallel computation inside ``launch`` is out of sync.
Bug B scalar args coincidentally equal to a global tensor dim get auto-remapped
The dim_map at [context.py:712-770](src/kernbench/runtime_api/context.py#L712-L770)
is keyed by *value*, so any scalar whose value coincides with a
global tensor dim gets rewritten to that dim's local value — even
when the scalar is unrelated. ``d_head=64`` coincides with the
multi_user K's global M = ``S_kv_per_rank * n = 16 * 4 = 64``, so
the kernel receives ``d_head = 16`` (the post-Bug-A local) or
``d_head = 4`` (the pre-Bug-A local) instead of ``64``.
Legacy bench kernels rely on auto-remap (e.g. ``test_va_offset.py``
passes global N and expects the kernel to see local N). The fix is
opt-out, not removal: ``ctx.launch(..., _auto_dim_remap=False)``
preserves scalars exactly as passed, default behavior unchanged.
Both tests fail today. Phase 2 fixes them in [src/kernbench/runtime_api/context.py](src/kernbench/runtime_api/context.py).
"""
from __future__ import annotations
from pathlib import Path
from kernbench.policy.placement.dp import DPPolicy
from kernbench.runtime_api.context import RuntimeContext
from kernbench.runtime_api.types import DeviceSelector
from kernbench.sim_engine.engine import GraphEngine
from kernbench.topology.builder import load_topology
TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
def _make_ctx(corr_id: str) -> RuntimeContext:
graph = load_topology(TOPOLOGY_PATH)
engine = GraphEngine(graph)
return RuntimeContext(
engine=engine, target_device=DeviceSelector("sip:0"),
correlation_id=corr_id, spec=graph.spec,
)
def test_topology_num_cubes_is_16_baseline_assumption():
"""Sanity: confirm the topology this test assumes (16 cubes per SIP).
If this fails, recheck the topology.yaml cube_mesh setting before
interpreting the other failures below. ``_num_cubes`` is initialized
lazily by ``_ensure_allocators`` on first tensor op, so trigger it."""
ctx = _make_ctx("dim-baseline")
ctx._ensure_allocators()
assert ctx._num_cubes == 16, (
f"expected default topology.yaml to give 16 cubes per SIP, "
f"got {ctx._num_cubes}"
)
def test_ctx_launch_local_shape_honors_dppolicy_num_cubes():
"""Bug A. ``DPPolicy(num_cubes=4)`` must be the divisor for
row_wise sharding inside ctx.launch's dim_map, not the topology's 16.
Setup: K-like tensor with M_global = 80 (cleanly divisible by both
4 and 16, distinct local values 20 vs 5). Pass M_global as a kernel
scalar; the kernel records what it received. With correct dim_map,
scalar 80 is remapped to 20 (80 / dp.num_cubes). With current code,
it is remapped to 5 (80 / self._num_cubes = 16).
"""
captured: dict[str, int] = {}
def _kernel(t, m_scalar, *, tl): # noqa: ARG001
captured["m_scalar"] = int(m_scalar)
ctx = _make_ctx("dim-bugA")
dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=4, num_pes=8)
t = ctx.zeros((80, 64), dtype="f16", dp=dp, name="t80x64")
ctx.launch("bugA_capture", _kernel, t, 80)
ctx.wait_all()
assert "m_scalar" in captured, "kernel was not invoked"
assert captured["m_scalar"] == 20, (
f"expected dim_map to divide 80 by dp.num_cubes=4 → 20; "
f"got {captured['m_scalar']} (likely divided by topology cubes=16)"
)
def test_ctx_launch_scalar_passed_through_when_auto_remap_disabled():
"""Bug B. Scalars must not be silently remapped when their value
happens to equal a tensor's global dim — at minimum the caller must
have an opt-out.
Setup: K-like tensor with M_global = 64 row_wise. Pass d_head = 64
as a scalar (semantically unrelated to K's M, but coincidentally
equal). The kernel records d_head. With ``_auto_dim_remap=False``
on ctx.launch, d_head must stay 64.
Today: ``_auto_dim_remap`` kwarg doesn't exist → TypeError. After
Phase 2: kwarg exists, defaults to True (legacy unchanged); passing
False preserves the scalar.
"""
captured: dict[str, int] = {}
def _kernel(t, d_head, *, tl): # noqa: ARG001
captured["d_head"] = int(d_head)
ctx = _make_ctx("dim-bugB")
dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=4, num_pes=8)
t = ctx.zeros((64, 64), dtype="f16", dp=dp, name="t64x64")
ctx.launch(
"bugB_capture", _kernel, t, 64,
_auto_dim_remap=False,
)
ctx.wait_all()
assert captured.get("d_head") == 64, (
f"expected d_head scalar to pass through unchanged when "
f"_auto_dim_remap=False; got {captured.get('d_head')!r}"
)
+77
View File
@@ -0,0 +1,77 @@
"""Milestone benches: registration + figure/result generation (ADR-0054).
``milestone-1h-gemm`` / ``milestone-1h-ccl`` are eval benches: run via the
normal ``run_bench`` path, they regenerate every GEMM / allreduce figure +
CSV into ``benches/1H_milestone_output/{gemm,ccl}/``. The GEMM bench in
``MILESTONE_FAST=1`` mode just re-renders the committed sweep JSON (fast,
default-run here); the CCL bench drives both full sweeps (slow, opt-in).
"""
from __future__ import annotations
import re
from pathlib import Path
import pytest
from kernbench.benches.registry import resolve
from kernbench.runtime_api.bench_runner import run_bench
from kernbench.runtime_api.types import resolve_device
from kernbench.sim_engine.engine import GraphEngine
from kernbench.topology.builder import resolve_topology
import kernbench.benches.milestone_1h_ccl as ccl_bench
import kernbench.benches.milestone_1h_gemm as gemm_bench
_NAME_RE = re.compile(r"^[a-z][a-z0-9]*(-[a-z0-9]+)*$")
def _run(name: str):
topo = resolve_topology("topology.yaml")
return run_bench(
topology=topo, bench_fn=resolve(name).run, device=resolve_device(None),
engine_factory=lambda t, d: GraphEngine(
getattr(t, "topology_obj", t), enable_data=True,
),
)
def test_milestone_benches_registered():
for name in ("milestone-1h-gemm", "milestone-1h-ccl"):
spec = resolve(name)
assert spec.name == name
assert _NAME_RE.match(spec.name)
assert spec.description.strip()
@pytest.mark.skipif(
not gemm_bench.DEFAULT_SWEEP_JSON.exists(),
reason="gemm_sweep.json absent; run scripts/gemm_sweep.py first",
)
def test_milestone_gemm_fast_generates_figures(monkeypatch):
monkeypatch.setenv("MILESTONE_FAST", "1")
result = _run("milestone-1h-gemm")
assert result.completion.ok, result.completion
out = gemm_bench._OUTPUT_DIR
for png in (
"gemm_stage_breakdown.png",
"gemm_mac_utilization_measured.png",
"gemm_mac_utilization_theoretical_vs_measured.png",
):
assert (out / png).exists(), f"missing {png}"
@pytest.mark.slow
def test_milestone_ccl_generates_figures():
result = _run("milestone-1h-ccl")
assert result.completion.ok, result.completion
out = ccl_bench._OUTPUT_DIR
for artifact in (
"summary.csv",
"topology.png",
"comparison_mesh_vs_ring_vs_2DTorus_vs_theoretical_vs_fsim.png",
"AllReduce_LRAB_2Dtorus_6SiP_2x3_with_TCM_SRAM_HBM.png",
"AllReduce_LRAB_Ring1D_6SiP_1x6.png",
"AllReduce_LRAB_2Dtorus_6SiP_2x3.png",
"AllReduce_LRAB_2DMesh_6SiP_2x3.png",
):
assert (out / artifact).exists(), f"missing {artifact}"
+218
View File
@@ -0,0 +1,218 @@
"""Phase 1 spec test for the math-input snapshot race (IPCQ slot wrap).
Context (sub-cycle 4c.0 diagnostic):
The mesh decode kernel (_attention_mesh_mlo.py) issues many tl.recv()
calls against an IPCQ ring of ~8 slots. With n_ranks=8 and bidirectional
fan-out, each PE issues 3 recvs per step × 7 steps × 2 directions =
42 recvs per panel. The IPCQ slot index is ``my_tail % n_slots``, so
the ring wraps and a fresh recv overwrites a slot whose data a prior
math op had not yet snapshotted.
OpLogger.record_end currently snapshots math inputs by re-reading
MemoryStore at record_end time (op_log.py:97-113). When a later recv
has overwritten the input addr with a DIFFERENT-shape array between
record_start and record_end, MemoryStore.read raises
``Shape mismatch: stored (16, 64) vs requested (16, 1)`` and the
snapshot becomes None (or, in Phase 2 replay, surfaces the same
exception in DataExecutor).
Phase 1 expectation: this test currently fails. It asserts the
*desired* behavior: when the math input TensorHandle carries a
.data snapshot (captured at recv time before the slot was wrapped),
OpLogger MUST prefer that snapshot over MemoryStore.read.
After Phase 2 (snapshot propagation fix), this test passes and the
sub-cycle 4c.0 mesh decode end-to-end (test_attention_mesh_decode_diag
and test_milestone_gqa_llama70b) passes for the same reason.
See: docs/adr/ADR-0020 (two-phase execution),
docs/adr/ADR-0023 (IPCQ ring slots),
docs/adr/ADR-0027 (snapshot discipline for dma_write).
"""
from __future__ import annotations
import numpy as np
from kernbench.common.pe_commands import MathCmd, TensorHandle
from kernbench.sim_engine.memory_store import MemoryStore
from kernbench.sim_engine.op_log import OpLogger
# ── Helpers ──────────────────────────────────────────────────────
def _slot_handle(addr: int, shape: tuple[int, ...], dtype: str,
data: np.ndarray | None) -> TensorHandle:
"""Build a TensorHandle as tl.recv() would: addr=slot, .data=snapshot."""
nbytes = int(np.prod(shape)) * np.dtype(
{"f16": np.float16, "f32": np.float32}[dtype]
).itemsize
return TensorHandle(
id=f"slot_{addr:x}", addr=addr, shape=shape, dtype=dtype,
nbytes=nbytes, data=data, space="tcm",
)
def _out_handle(addr: int, shape: tuple[int, ...], dtype: str) -> TensorHandle:
nbytes = int(np.prod(shape)) * np.dtype(
{"f16": np.float16, "f32": np.float32}[dtype]
).itemsize
return TensorHandle(
id=f"out_{addr:x}", addr=addr, shape=shape, dtype=dtype,
nbytes=nbytes, data=None, space="tcm",
)
# ── Tests ─────────────────────────────────────────────────────────
def test_math_snapshot_lost_when_input_slot_overwritten_with_same_nbytes():
"""Baseline (passes today): if a later write at the input addr has the
SAME nbytes as the math input's expected shape, MemoryStore.read
returns the LATER data the snapshot is silently wrong. This is the
quiet variant of the bug; it does not raise, it just produces
incorrect numerical output in Phase 2.
This test documents that the current OpLogger behavior is wrong even
when shapes coincidentally match. The Phase 2 fix removes this
silent-corruption mode by preferring handle.data.
"""
store = MemoryStore()
slot_addr = 0x3000
# Original at recv time: filled with 7s.
original = np.full((16, 1), 7.0, dtype=np.float16)
store.write("tcm", slot_addr, original)
inp = _slot_handle(slot_addr, (16, 1), "f16", data=original.copy())
out = _out_handle(0x4000, (16, 1), "f16")
cmd = MathCmd(op="maximum", inputs=(inp,), out=out)
logger = OpLogger(memory_store=store)
logger.record_start(10.0, "sip0.cube0.pe0.pe_math", cmd)
# SIMULATE: a later recv writes a DIFFERENT array at the same slot
# (same nbytes as (16,1), so MemoryStore.read does not raise).
later = np.full((16, 1), 99.0, dtype=np.float16)
store.write("tcm", slot_addr, later)
logger.record_end(15.0, "sip0.cube0.pe0.pe_math", cmd)
snap = logger.records[0].params["input_snapshots"][0]
assert snap is not None
# Desired post-fix behavior: snapshot equals ``original``.
# Today: snapshot equals ``later`` — silent corruption.
np.testing.assert_array_equal(snap, original)
def test_math_snapshot_survives_input_slot_wrap_with_different_shape():
"""The hard-failure variant: a later recv overwrites the input slot
with a DIFFERENT-shape array (different nbytes), so MemoryStore.read
at record_end raises and the snapshot becomes None. Phase 2 replay
then surfaces this as the (16, 64) vs (16, 1) crash seen in
test_attention_mesh_decode_diag.
Desired behavior: handle.data carries the recv-time snapshot, so
OpLogger never has to look at MemoryStore for this input no race,
snapshot is correct.
"""
store = MemoryStore()
slot_addr = 0x3000
# Original at recv time: an (m, ) reduction result, shape (16, 1).
original = np.full((16, 1), 7.0, dtype=np.float16)
store.write("tcm", slot_addr, original)
inp = _slot_handle(slot_addr, (16, 1), "f16", data=original.copy())
out = _out_handle(0x4000, (16, 1), "f16")
cmd = MathCmd(op="maximum", inputs=(inp,), out=out)
logger = OpLogger(memory_store=store)
logger.record_start(10.0, "sip0.cube0.pe0.pe_math", cmd)
# SIMULATE the slot-wrap race: a later recv (an o triplet, shape
# (16, 64)) writes the same TCM slot. MemoryStore.read for shape
# (16, 1) now raises ValueError("Shape mismatch ...").
overwrite = np.full((16, 64), 99.0, dtype=np.float16)
store.write("tcm", slot_addr, overwrite)
logger.record_end(15.0, "sip0.cube0.pe0.pe_math", cmd)
snap = logger.records[0].params["input_snapshots"][0]
# Today: snap is None (read raised, except branch returned None).
# Post-fix: handle.data preferred → snap is original.
assert snap is not None, (
"input snapshot was lost when the recv slot was wrapped — "
"OpLogger must prefer handle.data over MemoryStore.read for "
"math inputs whose handle carries a .data snapshot"
)
assert snap.shape == (16, 1)
np.testing.assert_array_equal(snap, original)
def test_math_snapshot_handle_data_with_multiple_inputs():
"""maximum/binary math has 2 inputs; both must use their carried
snapshots independently (e.g. m_running merged with m_from_W where
only m_from_W came from a recv slot)."""
store = MemoryStore()
# Input 0: a running m value held in PE scratch (no .data; OpLogger
# falls back to MemoryStore.read as today). Its addr is stable —
# not subject to the slot-wrap race.
scratch_addr = 0x5000
m_running = np.full((16, 1), 3.0, dtype=np.float16)
store.write("tcm", scratch_addr, m_running)
inp0 = _slot_handle(scratch_addr, (16, 1), "f16", data=None)
# Input 1: m_from_W via tl.recv — carries snapshot in .data, addr
# is the recv slot which WILL be wrapped before record_end.
slot_addr = 0x3000
m_from_W = np.full((16, 1), 7.0, dtype=np.float16)
store.write("tcm", slot_addr, m_from_W)
inp1 = _slot_handle(slot_addr, (16, 1), "f16", data=m_from_W.copy())
out = _out_handle(0x4000, (16, 1), "f16")
cmd = MathCmd(op="maximum", inputs=(inp0, inp1), out=out)
logger = OpLogger(memory_store=store)
logger.record_start(10.0, "sip0.cube0.pe0.pe_math", cmd)
# Slot 0x3000 gets wrapped by a later recv with a different shape.
overwrite = np.full((16, 64), 99.0, dtype=np.float16)
store.write("tcm", slot_addr, overwrite)
logger.record_end(15.0, "sip0.cube0.pe0.pe_math", cmd)
snaps = logger.records[0].params["input_snapshots"]
assert len(snaps) == 2
# Input 0 (no carried snapshot, addr stable): MemoryStore read still
# works. This must keep working post-fix.
assert snaps[0] is not None
np.testing.assert_array_equal(snaps[0], m_running)
# Input 1 (carried snapshot, slot wrapped): must come from .data.
assert snaps[1] is not None
assert snaps[1].shape == (16, 1)
np.testing.assert_array_equal(snaps[1], m_from_W)
def test_math_snapshot_falls_back_to_memory_store_when_handle_data_is_none():
"""Backward-compat: handles with .data=None must continue to use
MemoryStore.read as today. Most math inputs (intermediate results
from local tl.dot / tl.exp etc.) have data=None and their TCM addrs
are stable for the kernel's lifetime."""
store = MemoryStore()
addr = 0x6000
arr = np.full((8, 8), 2.0, dtype=np.float16)
store.write("tcm", addr, arr)
inp = _slot_handle(addr, (8, 8), "f16", data=None)
out = _out_handle(0x7000, (8, 8), "f16")
cmd = MathCmd(op="exp", inputs=(inp,), out=out)
logger = OpLogger(memory_store=store)
logger.record_start(10.0, "sip0.cube0.pe0.pe_math", cmd)
logger.record_end(15.0, "sip0.cube0.pe0.pe_math", cmd)
snap = logger.records[0].params["input_snapshots"][0]
assert snap is not None
np.testing.assert_array_equal(snap, arr)
+334
View File
@@ -0,0 +1,334 @@
"""Generate docs/adr/INDEX.md (and docs/adr-ko/INDEX.md) from the ADR corpus.
Auto-derives a section-based index following the same classification as
the /report skill Design Principles / High-level Architecture /
Detailed Architecture (by component) / Implementation Decisions
(by topic). Run before publishing to refresh INDEX.md.
The classification table below is the single source of truth. When a new
ADR is added under docs/adr/, append an entry to ``CLASSIFICATION``. The
script exits 1 if any ADR file is missing from the table or any title
cannot be parsed, so omissions surface in CI.
Usage:
python tools/generate_adr_index.py [--root <repo-root>] [--check]
--check : exit 1 if the generated INDEX differs from the on-disk file
(used by CI to detect un-regenerated indexes).
"""
from __future__ import annotations
import argparse
import re
import sys
from pathlib import Path
ADR_FILENAME_RE = re.compile(r"^ADR-(\d{4})-([a-z0-9_-]+)\.md$")
# Title separator may be ":" (most ADRs) or "—" (em-dash; ADR-0033 uses
# this). The verifier (tools/verify_adr_lang_pairs.py) only checks the
# number, so both styles already coexist in the corpus.
TITLE_RE = re.compile(r"^# ADR-(\d{4})\s*[:—]\s*(.+?)\s*$")
DESIGN_PRINCIPLES = "Design Principles"
HIGH_LEVEL = "High-level Architecture"
DETAILED = "Detailed Architecture"
IMPL_DECISIONS = "Implementation Decisions"
# (section, subgroup) per ADR. subgroup is used to sub-divide Detailed
# (by component, see DETAILED_COMPONENTS) and Implementation (by topic).
# Add a line here when introducing a new ADR.
CLASSIFICATION: dict[int, tuple[str, str | None]] = {
# Design Principles
13: (DESIGN_PRINCIPLES, None),
33: (DESIGN_PRINCIPLES, None),
# High-level Architecture
3: (HIGH_LEVEL, "System hierarchy (Tray / SIP / CUBE / PE)"),
7: (HIGH_LEVEL, "Runtime API ↔ sim_engine boundaries"),
16: (HIGH_LEVEL, "IOChiplet NOC and memory data path"),
17: (HIGH_LEVEL, "Cube NOC and HBM connectivity"),
# Detailed Architecture (subgroup matches DETAILED_COMPONENTS entries)
14: (DETAILED, "pe_pipeline"), # covers pe_cpu/pe_dma/pe_fetch_store/pe_gemm/pe_math/pe_scheduler
23: (DETAILED, "pe_ipcq"),
34: (DETAILED, "hbm_ctrl"),
35: (DETAILED, "m_cpu"),
36: (DETAILED, "io_cpu"),
37: (DETAILED, "forwarding"),
38: (DETAILED, "pcie_ep"),
39: (DETAILED, "pe_mmu"),
40: (DETAILED, "pe_tcm"),
41: (DETAILED, "sram"),
42: (DETAILED, "tiling"),
# Implementation Decisions
1: (IMPL_DECISIONS, "Address Scheme"),
2: (IMPL_DECISIONS, "Routing & Helper API"),
4: (IMPL_DECISIONS, "Memory Semantics & Local-HBM Bandwidth"),
5: (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"),
6: (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"),
8: (IMPL_DECISIONS, "Tensor Deployment and Allocation"),
9: (IMPL_DECISIONS, "Kernel Execution and Host-Device Messaging"),
10: (IMPL_DECISIONS, "CLI Surface and Semantics"),
11: (IMPL_DECISIONS, "Address Scheme"),
12: (IMPL_DECISIONS, "Kernel Execution and Host-Device Messaging"),
15: (IMPL_DECISIONS, "Component Port/Wire Fabric Model"),
20: (IMPL_DECISIONS, "Two-Pass Data Execution"),
22: (IMPL_DECISIONS, "2D Grid Program Identity"),
24: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
25: (IMPL_DECISIONS, "IPCQ Direction Addressing"),
26: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
27: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
32: (IMPL_DECISIONS, "Intercube All-Reduce"),
43: (IMPL_DECISIONS, "Evaluation Harnesses"),
44: (IMPL_DECISIONS, "Evaluation Harnesses"),
45: (IMPL_DECISIONS, "Bench Module Contract"),
46: (IMPL_DECISIONS, "Kernel-side tl.* API (TLContext)"),
47: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
48: (IMPL_DECISIONS, "Memory Allocator Algorithms"),
49: (IMPL_DECISIONS, "Probe Subcommand"),
50: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
51: (IMPL_DECISIONS, "Routing & Helper API"),
52: (IMPL_DECISIONS, "Sim-engine Op Log and Memory Store Schemas"),
53: (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"),
54: (IMPL_DECISIONS, "Evaluation Harnesses"),
}
# Canonical component order for the Detailed Architecture section.
# Each entry: (component_name, list[ADR-numbers that cover it]).
# Order matches src/kernbench/components/builtin/*.py alphabetical
# (the same order /report uses).
DETAILED_COMPONENTS: list[tuple[str, list[int]]] = [
("forwarding", [37]),
("hbm_ctrl", [34]),
("io_cpu", [36]),
("m_cpu", [35]),
("pcie_ep", [38]),
("pe_cpu", [14]),
("pe_dma", [14, 23]),
("pe_fetch_store", [14]),
("pe_gemm", [14]),
("pe_ipcq", [23]),
("pe_math", [14]),
("pe_mmu", [39]),
("pe_scheduler", [14]),
("pe_tcm", [40]),
("sram", [41]),
("tiling", [42]),
]
def _strip_bom(text: str) -> str:
"""Strip leading UTF-8 BOM if present."""
if text and ord(text[0]) == 0xFEFF:
return text[1:]
return text
def _find_adrs(adr_dir: Path) -> list[tuple[int, str, Path]]:
"""Return [(num, slug, path), ...] for ADR files in adr_dir, sorted by num."""
out: list[tuple[int, str, Path]] = []
for p in sorted(adr_dir.iterdir()):
if not p.is_file():
continue
m = ADR_FILENAME_RE.match(p.name)
if not m:
continue
out.append((int(m.group(1)), m.group(2), p))
out.sort(key=lambda t: t[0])
return out
def _extract_title(path: Path) -> str:
"""Parse the title from the first line `# ADR-NNNN: <title>`. Strips BOM."""
text = _strip_bom(path.read_text(encoding="utf-8"))
first_line = text.split("\n", 1)[0] if text else ""
m = TITLE_RE.match(first_line)
if not m:
raise ValueError(
f"{path.name}: cannot parse title from first line: {first_line!r}"
)
return m.group(2)
def _build_index(adr_dir: Path, link_prefix: str) -> str:
"""Build the INDEX.md text for adr_dir.
link_prefix is the relative href used for ADR links (e.g., ``./``
so links resolve relative to the INDEX file location).
"""
adrs = _find_adrs(adr_dir)
if not adrs:
raise RuntimeError(f"No ADR files found under {adr_dir}")
# Validate every ADR is classified.
missing = sorted(num for num, _slug, _ in adrs if num not in CLASSIFICATION)
if missing:
raise RuntimeError(
"ADR(s) missing from CLASSIFICATION table in "
"tools/generate_adr_index.py: "
+ ", ".join(f"ADR-{n:04d}" for n in missing)
+ ". Add an entry for each."
)
# Map: num → (filename, title)
num_to_meta: dict[int, tuple[str, str]] = {}
for num, _slug, path in adrs:
num_to_meta[num] = (path.name, _extract_title(path))
# ── Section assembly ────────────────────────────────────────────
lines: list[str] = []
lines.append("# ADR Index")
lines.append("")
lines.append(
f"Auto-generated by `tools/generate_adr_index.py`. "
f"Total ADRs: **{len(adrs)}**."
)
lines.append("")
lines.append(
"Classification mirrors the `/report` skill's section assignment. "
"When adding a new ADR, also add an entry to the "
"`CLASSIFICATION` table in `tools/generate_adr_index.py`."
)
lines.append("")
def fmt_entry(num: int) -> str:
fname, title = num_to_meta[num]
return f"- [ADR-{num:04d}]({link_prefix}{fname}) — {title}"
# Design Principles
lines.append("## Design Principles")
lines.append("")
nums = sorted(n for n, (sec, _) in CLASSIFICATION.items()
if sec == DESIGN_PRINCIPLES and n in num_to_meta)
for n in nums:
lines.append(fmt_entry(n))
lines.append("")
# High-level Architecture (preserve declaration order via CLASSIFICATION dict's insertion order)
lines.append("## High-level Architecture")
lines.append("")
nums = sorted(n for n, (sec, _) in CLASSIFICATION.items()
if sec == HIGH_LEVEL and n in num_to_meta)
for n in nums:
sub = CLASSIFICATION[n][1] or ""
fname, title = num_to_meta[n]
if sub:
lines.append(
f"- [ADR-{n:04d}]({link_prefix}{fname}) — {title}"
f" _({sub})_"
)
else:
lines.append(fmt_entry(n))
lines.append("")
# Detailed Architecture (canonical component order)
lines.append("## Detailed Architecture")
lines.append("")
lines.append("One subsection per component file under `src/kernbench/components/builtin/`.")
lines.append("")
for comp, adr_nums in DETAILED_COMPONENTS:
lines.append(f"### {comp}")
lines.append("")
if adr_nums:
for n in adr_nums:
if n not in num_to_meta:
raise RuntimeError(
f"DETAILED_COMPONENTS references ADR-{n:04d} for "
f"'{comp}' but no such ADR file exists."
)
lines.append(fmt_entry(n))
else:
lines.append("_(no ADR coverage)_")
lines.append("")
# Implementation Decisions — group by subgroup, preserving first-appearance order.
lines.append("## Implementation Decisions")
lines.append("")
topic_order: list[str] = []
topic_to_nums: dict[str, list[int]] = {}
for n, (sec, sub) in CLASSIFICATION.items():
if sec != IMPL_DECISIONS or n not in num_to_meta:
continue
topic = sub or "Uncategorized"
if topic not in topic_to_nums:
topic_order.append(topic)
topic_to_nums[topic] = []
topic_to_nums[topic].append(n)
# Stable order: by smallest ADR-number in topic, so older infra appears first.
topic_order.sort(key=lambda t: min(topic_to_nums[t]))
for topic in topic_order:
lines.append(f"### {topic}")
lines.append("")
for n in sorted(topic_to_nums[topic]):
lines.append(fmt_entry(n))
lines.append("")
return "\n".join(lines).rstrip() + "\n"
def _check_or_write(path: Path, content: str, check: bool) -> bool:
"""Write content to path, or compare in --check mode. Returns True on diff."""
existing = path.read_text(encoding="utf-8") if path.exists() else ""
if check:
if existing != content:
print(f"[diff] {path} would change.")
return True
return False
path.write_text(content, encoding="utf-8")
if existing != content:
print(f"[wrote] {path}")
else:
print(f"[unchanged] {path}")
return False
def main(argv: list[str] | None = None) -> int:
p = argparse.ArgumentParser(description=__doc__)
p.add_argument(
"--root", type=Path, default=Path.cwd(),
help="Repository root (default: cwd)",
)
p.add_argument(
"--check", action="store_true",
help="Exit 1 if generated INDEX would differ from disk",
)
args = p.parse_args(argv)
en_dir = args.root / "docs" / "adr"
ko_dir = args.root / "docs" / "adr-ko"
if not en_dir.is_dir():
print(f"error: {en_dir} does not exist", file=sys.stderr)
return 1
any_diff = False
try:
en_index = _build_index(en_dir, link_prefix="./")
except (RuntimeError, ValueError) as e:
print(f"error (EN): {e}", file=sys.stderr)
return 1
any_diff |= _check_or_write(en_dir / "INDEX.md", en_index, args.check)
if ko_dir.is_dir():
try:
ko_index = _build_index(ko_dir, link_prefix="./")
except (RuntimeError, ValueError) as e:
print(f"error (KO): {e}", file=sys.stderr)
return 1
any_diff |= _check_or_write(ko_dir / "INDEX.md", ko_index, args.check)
if args.check and any_diff:
print(
"INDEX.md is out of date. "
"Run `python tools/generate_adr_index.py` to refresh.",
file=sys.stderr,
)
return 1
return 0
if __name__ == "__main__":
sys.exit(main())