Files
kernbench2/docs/adr/ADR-0042-prog-tile-plan-generators.md
ywkang 1f36baa898 ADR: add 0038-0042 (pcie_ep, pe_mmu, pe_tcm, sram, tiling)
Fill component-model coverage gaps surfaced by /report's G4 analysis.
Each ADR documents the component's First action, latency model, and
honest notes on dormant code or implementation asymmetries discovered
during re-evaluation against current code.

- 0038 pcie_ep: thin protocol-overhead model; ComponentBase forwarding
  worker as-is; named-node contract for router helpers
- 0039 pe_mmu: component + utility dual role; sub-page region stopgap;
  D2.1 flags pipeline path missing mmu.overhead_ns timeout (asymmetric
  with non-pipeline; not visible at default tlb_overhead_ns=0)
- 0040 pe_tcm: dual-channel BW serialization (read/write Resource cap=1);
  TcmRequest schema owned by TCM; timing-only (no data store)
- 0041 sram: terminal scratchpad model + ResponseMsg on reverse path;
  D1.1 flags _worker override as currently dormant (no Transaction
  actually targets the SRAM node today)
- 0042 tiling: pure plan-generator module, not a component; corrects
  the G4 misclassification; pins GEMM/Math stage sequences and
  epilogue scope contract

Also: /report skill G3 refinement — only flag older->newer asymmetric
cross-references; newer->older (e.g., 0034-0037 citing infrastructure
ADRs) are expected one-way and no longer reported.

Bilingual pair verifier (tools/verify_adr_lang_pairs.py) passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:43:03 -07:00

8.0 KiB

ADR-0042: Tile Plan Generators — GEMM/Math Pipeline Plan Builders

Status

Accepted (2026-05-20).

This ADR pins down tiling.py as a plan-generator module, not a SimPy component.

ADR-0014 (PE Pipeline Execution Model) D6 (tile plan / self-routing) does not specify the tile-plan generation algorithm itself; this ADR fills that gap.

First action

When generate_gemm_plan(M, K, N, tile_m, tile_k, tile_n, ..., pe_prefix, a_pinned, b_pinned, epilogue_specs) is called, the very first action is computing tile counts and constructing the PE-component ID strings:

M_tiles = max(1, ceil(M / tile_m))
K_tiles = max(1, ceil(K / tile_k))
N_tiles = max(1, ceil(N / tile_n))
dma_id   = f"{pe_prefix}.pe_dma"
fetch_id = f"{pe_prefix}.pe_fetch_store"
gemm_id  = f"{pe_prefix}.pe_gemm"
math_id  = f"{pe_prefix}.pe_math"

In short, the plan generator's first act is "compute ceiling tile counts and assemble the four sub-component IDs for this PE once". No SimPy event or environment is touched — this module is a pure function.

generate_math_plan(M, N, tile_m, tile_n, ..., math_op, src_addr, dst_addr, pe_prefix) likewise begins by computing M_tiles, N_tiles and assembling three component IDs (dma_id, fetch_id, math_id).

Context

ADR-0014 D6 agreed that "PE_SCHEDULER, on receiving a CompositeCmd, generates a TilePlan and feeds self-routing tile tokens". But the concrete plan generation algorithm lives in src/kernbench/components/builtin/tiling.py, which:

  • Defines no component — it is a pair of pure functions (generate_gemm_plan, generate_math_plan).
  • Does not depend on the SimPy environment, queues, op_log, or hooks.
  • Returns a PipelinePlan (dataclass).

The original G4 analysis incorrectly described tiling.py as a component; it is in fact a plan-builder helper consumed by PE_SCHEDULER. Pinning this down in its own ADR (paired with ADR-0014 D6) prevents:

  • Ambiguity over whether plan generation belongs to PE_SCHEDULER or a separate module.
  • Inconsistent rationale for stage sequences (e.g., FETCH/STORE position) between GEMM and Math plans.
  • Undocumented branching rationale for a_pinned / b_pinned / epilogue_specs.

Decision

D1. tiling is a pure plan-generator module, not a component

components/builtin/tiling.py defines no ComponentBase subclass. It exports two module-level functions:

  • generate_gemm_plan(...) -> PipelinePlan
  • generate_math_plan(...) -> PipelinePlan

There is no tiling node in the topology graph. It lives in builtin/ because it is a direct helper for PE_SCHEDULER (ADR-0014 D6) and is conceptually a PE_SCHEDULER internal utility.

D2. GEMM plan stage sequence — M → N → K order

For each (m, n, k) tile (default — no operand pinning, no epilogue):

[DMA_READ(A)] → [DMA_READ(B)] → FETCH → GEMM
                                ↑
                                ↓
(last k tile only)              [MATH(output_tile)]* → STORE → DMA_WRITE

k_tile epilogue inserts a MATH stage immediately after GEMM on every K-tile; output_tile epilogue inserts MATH stages once per (m, n) after the final K-tile but before STORE/DMA_WRITE. The K-loop accumulator stays in the register file across K-tiles — STORE/DMA_WRITE happens only when last_k.

D3. Operand pinning — a_pinned / b_pinned

If a caller passes a_pinned=True, the A DMA_READ is omitted from every (m, n, k) tile. Semantically: the caller (e.g., tl.composite) has already staged all of A in TCM via a prior tl.load, and signals so to the plan generator.

The branch is made at plan time (not at runtime). Therefore the stage record count in op_log changes deterministically with pinning, and sweep analyses (e.g., gemm_sweep's stage record count) see this decision directly.

D4. Epilogue scope — k_tile vs output_tile

epilogue_specs is an iterable of op-spec objects. Each op object is expected to have:

  • op.kind: str — math op name (e.g., "dequant", "bias", "relu", "scale"). Placed into the stage's params["op_kind"].
  • op.scope: ScopeScope.K_TILE or Scope.OUTPUT_TILE (Scope enum in kernbench.common.pe_commands).
  • Op-specific extras (e.g., bias, scale, factor) — currently not used by the plan generator; consumed at runtime by PE_MATH.

The plan generator partitions by getattr(o, "scope", None):

  • scope == Scope.K_TILE: adds a MATH stage right after GEMM on every K-tile.
  • scope == Scope.OUTPUT_TILE: adds MATH stages just before STORE on the last K-tile per (m, n).

Ops with neither scope value (e.g., missing attribute) are dropped silentlygetattr(..., None) == Scope.X is False for both. Picking a default (output_tile) is the caller's responsibility (e.g., tl.composite), not the plan generator's. This aligns with ADR-0014's composite epilogue contract.

Scope is imported lazily inside the function to avoid the circular path pe_commands ← pe_types ← tiling. This is intentional and not a refactor target — keeping tiling free of compile-time pe_commands dependencies preserves the module boundary (D1).

D5. Math plan stage sequence — M → N order

For each (m, n) tile:

DMA_READ → FETCH → MATH → STORE → DMA_WRITE

There is no K dimension, so concepts like epilogue or accumulator residency do not apply. PE_FETCH_STORE's register-file accounting follows the same pattern as the GEMM plan.

D6. Plans are data — no SimPy dependency

PipelinePlan is a dataclass in pe_types.py holding tiles: list[TilePlan]. Each TilePlan holds stages: tuple[Stage, ...]. The plan itself is near-immutable (only Stage.params: dict is mutable) and holds no SimPy objects.

At runtime, PE_SCHEDULER consumes the plan's first stage, builds a TileToken, and feeds it into the pipeline. The TileToken carries plan: TilePlan, stage_idx: int, and a cached params: dict. Self-routing proceeds by TileToken.advance() caching the next stage's params (ADR-0014 D6).

D7. Plan generator contract — pure, deterministic, idempotent

Two calls with identical inputs return identical PipelinePlan instances (including TilePlan.stages order). This contract aligns with ADR-0014 D6's "deterministic tile dispatch order".

No side effects (no SimPy events, no file I/O, no global state) — tests can call the generators directly without an environment object (some cases in tests/test_pe_pipeline.py rely on this).

Alternatives Considered

A1. Make tiling a component (e.g., PE_PLANNER)

Rejected. Plan generation consumes no SimPy time — it is a pure decision algorithm. Making it a component would (a) add unnecessary infrastructure (inbox, resources), and (b) split PE_SCHEDULER's flow into "receive plan" plus "feed tiles", inserting a meaningless hop.

A2. Move plan generation into PE_SCHEDULER as methods

Rejected (currently). Module separation provides (1) testability and (2) extensibility for additional plan algorithms (e.g., DTensor-aware) — add a new function. If plan kinds proliferate enough to require explicit dispatch, a future ADR can introduce a plan factory on PE_SCHEDULER.

A3. Make plans fully immutable (frozen dataclass + tuple)

Partially adopted. Stage and TilePlan are dataclasses but not frozen, because Stage.params: dict is populated at plan-generation time and read at runtime (cached by TileToken on advance). Moving dict → frozendict pays migration cost without enough benefit. Convention: do not mutate after generation.

Consequences

  • tiling.py is documented as a plan-generator module, not a component — preempting future G4-style "this component lacks an ADR" analyses.
  • The GEMM plan's stage sequence (D2) and pinning / epilogue branching (D3 / D4) are pinned, providing a clear interpretation basis for sweep analyses (e.g., scripts/gemm_sweep.py's stage record counts).
  • The plan generator's pure contract (D7) enables environment-free testing in line with ADR-0013 (verification strategy).
  • Future plan kinds (DTensor-aware, K-major, ...) follow D1 / D6 / D7 as a baseline — just add a new function.