Files

T

ywkang 1f36baa898 ADR: add 0038-0042 (pcie_ep, pe_mmu, pe_tcm, sram, tiling)

Fill component-model coverage gaps surfaced by /report's G4 analysis.
Each ADR documents the component's First action, latency model, and
honest notes on dormant code or implementation asymmetries discovered
during re-evaluation against current code.

- 0038 pcie_ep: thin protocol-overhead model; ComponentBase forwarding
  worker as-is; named-node contract for router helpers
- 0039 pe_mmu: component + utility dual role; sub-page region stopgap;
  D2.1 flags pipeline path missing mmu.overhead_ns timeout (asymmetric
  with non-pipeline; not visible at default tlb_overhead_ns=0)
- 0040 pe_tcm: dual-channel BW serialization (read/write Resource cap=1);
  TcmRequest schema owned by TCM; timing-only (no data store)
- 0041 sram: terminal scratchpad model + ResponseMsg on reverse path;
  D1.1 flags _worker override as currently dormant (no Transaction
  actually targets the SRAM node today)
- 0042 tiling: pure plan-generator module, not a component; corrects
  the G4 misclassification; pins GEMM/Math stage sequences and
  epilogue scope contract

Also: /report skill G3 refinement — only flag older->newer asymmetric
cross-references; newer->older (e.g., 0034-0037 citing infrastructure
ADRs) are expected one-way and no longer reported.

Bilingual pair verifier (tools/verify_adr_lang_pairs.py) passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 14:43:03 -07:00

9.3 KiB

Raw Blame History

ADR-0042: Tile Plan Generators — GEMM/Math 파이프라인 plan 빌더

Status

Accepted (2026-05-20).

본 ADR은 tiling.py가 SimPy 컴포넌트가 아니라 plan-generator 모듈임을 명시한다.

ADR-0014 (PE Pipeline Execution Model) 의 D6 (tile plan / self-routing) 가 tile-plan 생성 알고리즘을 직접 정의하지 않으므로, 본 ADR이 그 비어 있는 자리를 채운다.

First action (제일 처음에 하는 일)

generate_gemm_plan(M, K, N, tile_m, tile_k, tile_n, ..., pe_prefix, a_pinned, b_pinned, epilogue_specs)이 호출되면 가장 먼저 하는 일은 타일 수 계산과 컴포넌트 ID 문자열 구성이다:

M_tiles = max(1, ceil(M / tile_m))
K_tiles = max(1, ceil(K / tile_k))
N_tiles = max(1, ceil(N / tile_n))
dma_id   = f"{pe_prefix}.pe_dma"
fetch_id = f"{pe_prefix}.pe_fetch_store"
gemm_id  = f"{pe_prefix}.pe_gemm"
math_id  = f"{pe_prefix}.pe_math"

즉 **plan generator의 첫 일은 "타일 개수를 ceiling으로 산출하고, 이 PE의 sub-component ID 4개를 한 번에 짜놓는 것"**이다. SimPy 이벤트나 환경 객체는 일절 다루지 않는다 — 이 모듈은 순수 함수다.

generate_math_plan(M, N, tile_m, tile_n, ..., math_op, src_addr, dst_addr, pe_prefix) 도 마찬가지로 M_tiles, N_tiles 산출과 component ID 3개 (dma_id, fetch_id, math_id) 구성이 첫 일이다.

Context

ADR-0014 D6은 "PE_SCHEDULER가 CompositeCmd를 받으면 TilePlan을 생성하고 self-routing tile token을 피드한다"고만 합의했다. 그러나 코드에서는 plan 생성 알고리즘의 구체적 내용이 src/kernbench/components/builtin/tiling.py 모듈에 자리잡고 있고, 이 모듈은:

컴포넌트가 아니라 순수 함수의 모음이다 (generate_gemm_plan, generate_math_plan).
SimPy 환경, 큐, op_log, hook 등에 의존하지 않는다.
결과로 PipelinePlan (dataclass) 를 돌려준다.

기존 G4 분석은 tiling.py를 컴포넌트로 잘못 가정했으나, 실제는 PE_SCHEDULER에 주입되는 plan-builder 함수다. 이 차이는 ADR-0014 의 D6 와 짝을 이루는 별도 ADR로 못 박혀야 한다 — 그렇지 않으면:

"tile plan을 만드는 책임이 PE_SCHEDULER인가 별도 모듈인가" 가 모호.
GEMM plan과 Math plan의 stage sequence 가 일관성 있는지 (예: FETCH/STORE 위치) 의사결정 근거가 흩어진다.
a_pinned / b_pinned / epilogue_specs 같은 옵션이 왜 plan 단에서 분기되는지 근거 없음.

Decision

D1. tiling은 순수 plan-generator 모듈이며 컴포넌트가 아니다

components/builtin/tiling.py는 ComponentBase 하위 클래스를 정의하지 않는다. 모듈-레벨 함수 두 개만 노출한다:

generate_gemm_plan(...) -> PipelinePlan
generate_math_plan(...) -> PipelinePlan

토폴로지 그래프에서 tiling 이라는 노드는 존재하지 않는다. 명명상 builtin/ 디렉터리에 있는 이유는 PE_SCHEDULER (ADR-0014 D6) 의 직접 helper이기 때문이며, 의미상으로는 PE_SCHEDULER 내부 utility에 가깝다.

D2. GEMM plan의 stage 시퀀스 — `M → N → K` order

각 (m, n, k) 타일에 대한 stage 시퀀스 (operand pinning과 epilogue 미적용 기본):

[DMA_READ(A)] → [DMA_READ(B)] → FETCH → GEMM
                                ↑
                                ↓
(last k tile only)              [MATH(output_tile)]* → STORE → DMA_WRITE

k_tile epilogue는 매 K-타일마다 GEMM 직후, output_tile epilogue는 (m,n)당 마지막 K-타일에서 STORE/DMA_WRITE 직전에 한 번. K-루프 누적자(accumulator) 는 RegFile에 남아 K 타일들 사이에 STORE/DMA_WRITE가 발생하지 않는다 (last_k에서만 출력).

D3. Operand pinning — `a_pinned` / `b_pinned`

호출자가 a_pinned=True로 호출하면 모든 (m, n, k) 타일에서 A DMA_READ를 생략한다. 의미: 호출자(예: tl.composite)가 사전에 tl.load로 A 전체를 TCM에 한 번 적재했음을 plan generator에 알리는 신호.

이 분기는 plan 단에서 결정한다 (런타임 분기 아님). 따라서 op_log 상의 stage record 수는 pinning에 따라 결정적으로 달라지며, sweep 분석 측 (예: gemm_sweep 의 stage record count) 이 이 결정을 그대로 본다.

D4. Epilogue scope — `k_tile` vs `output_tile`

epilogue_specs는 op-spec 객체의 iterable이다. 각 op 객체는 다음 속성을 갖는 다고 가정한다:

op.kind: str — math op 이름 (예: "dequant", "bias", "relu", "scale"). stage의 params["op_kind"] 로 들어간다.
op.scope: Scope — Scope.K_TILE 또는 Scope.OUTPUT_TILE (Scope 는 kernbench.common.pe_commands 에 정의된 enum).
op-별 추가 필드 (예: bias, scale, factor) — 현재 plan generator는 사용 하지 않으며 런타임 (PE_MATH) 측이 소비.

plan generator는 getattr(o, "scope", None) 기준으로 두 그룹으로 분기:

scope == Scope.K_TILE: 매 K-타일 GEMM 직후 MATH stage 추가.
scope == Scope.OUTPUT_TILE: (m, n)당 마지막 K-타일 STORE 직전 MATH stage 추가.

scope 속성이 없거나 두 enum 어느 쪽도 아닌 op는 plan에 포함되지 않는다 (getattr(..., None) == Scope.X 가 둘 다 False). 기본값(output_tile) 채택은 호출자(예: tl.composite) 측 책임이며, plan generator는 이미 채워진 scope 값을 보고 분기할 뿐이다 (ADR-0014 의 composite epilogue 계약과 정렬).

Scope 임포트는 pe_commands ← pe_types ← tiling 의 순환 참조를 피하기 위해 함수 내부에서 lazy import 한다. 이는 의도된 패턴이며 개선 대상이 아니다 (D1의 "tiling은 PE_SCHEDULER의 utility" 관점에서, pe_commands에 대한 컴파일타임 의존 이 없는 편이 모듈 경계를 깔끔히 유지함).

D5. Math plan의 stage 시퀀스 — `M → N` order

각 (m, n) 타일에 대한 stage 시퀀스:

DMA_READ → FETCH → MATH → STORE → DMA_WRITE

K 차원이 없으므로 epilogue / accumulator residency 같은 개념은 적용되지 않는다. PE_FETCH_STORE의 register-file 회계는 GEMM plan과 동일한 방식으로 다뤄진다.

D6. plan은 데이터다 — SimPy 의존성 없음

PipelinePlan 은 pe_types.py에 정의된 dataclass로, tiles: list[TilePlan]을 보유. 각 TilePlan 은 stages: tuple[Stage, ...] 를 보유. plan 자체는 immutable에 가까운 데이터 구조이며 (Stage 의 params: dict 만 mutable), SimPy 객체나 event를 갖지 않는다.

런타임 시점에 PE_SCHEDULER가 plan 의 첫 stage를 보고 TileToken을 생성하여 파이프라인에 피드하며, TileToken 이 plan: TilePlan, stage_idx: int, params: dict 를 들고 다닌다. self-routing은 TileToken.advance() 가 다음 stage의 params를 캐시하는 방식으로 진행된다 (ADR-0014 D6).

D7. plan generator의 contract — pure, deterministic, idempotent

같은 입력으로 두 번 호출하면 같은 PipelinePlan을 돌려준다 (TilePlan.stages의 순서까지 deterministic). 이 contract는 ADR-0014 D6 의 "결정적 tile dispatch 순서" 요구와 정렬된다.

부수효과(SimPy event, file I/O, 글로벌 상태) 없음 — 테스트에서 환경 객체 없이 호출 가능 (tests/test_pe_pipeline.py의 일부 케이스가 이 방식 사용).

Alternatives Considered

A1. tiling을 컴포넌트로 만들기 (e.g., PE_PLANNER)

기각. plan 생성은 SimPy 시간을 소비하지 않는 결정 알고리즘이다. 컴포넌트로 만들면 (a) inbox·자원 등 불필요한 인프라가 따라붙고, (b) PE_SCHEDULER 가 "plan 받기" → "tile 피드" 두 단계를 분리해 받게 되어 의미 없는 hop이 생긴다.

A2. plan 생성을 PE_SCHEDULER 클래스 메서드로 옮기기

기각 (현재). 모듈 분리가 (1) 테스트 용이성, (2) 다른 plan 알고리즘 (예: DTensor-aware plan) 도입 시 추가 함수만 정의하면 되는 확장성을 준다. 만약 향후 plan 종류가 많아져 명시적 dispatch가 필요해지면, 그때 PE_SCHEDULER에 plan factory를 두는 것을 별도 ADR로 도입한다.

A3. plan을 immutable로 강제 (frozen dataclass + tuple)

부분 채택. Stage 와 TilePlan 은 dataclass지만 frozen은 아니다. 이유: Stage.params: dict 가 plan generator 시점에 채워지고 런타임에서 읽히기만 한다 (TileToken 이 advance 시 캐시할 뿐). 완전 frozen은 dict → frozendict 마이그레이션 비용 대비 이득이 적다. 다만 plan 단계 외에는 mutation 하지 말 것을 컨벤션으로 유지한다.

Consequences

tiling.py가 컴포넌트가 아니라 plan-generator 모듈임이 ADR-level에서 명시되어, G4 같은 미래의 "이 컴포넌트는 ADR이 없다"는 분석을 차단한다.
GEMM plan의 stage sequence (D2) 와 pinning/epilogue 분기 (D3·D4) 가 ADR로 굳어지므로, sweep 분석 (scripts/gemm_sweep.py)의 stage record count 해석 근거가 명확해진다.
plan generator의 pure contract (D7) 덕분에 테스트가 환경 없이 plan 검증 가능 — ADR-0013 (verification strategy) 의 "behavior validated by tests with meaningful input cases" 정신과 정렬.
향후 DTensor-aware plan, K-major plan 등 새 plan 종류 추가 시 본 ADR이 baseline 역할 — 새 함수만 추가하고 D1·D6·D7을 따른다.

9.3 KiB Raw Blame History