ADR: add 0038-0042 (pcie_ep, pe_mmu, pe_tcm, sram, tiling)
Fill component-model coverage gaps surfaced by /report's G4 analysis. Each ADR documents the component's First action, latency model, and honest notes on dormant code or implementation asymmetries discovered during re-evaluation against current code. - 0038 pcie_ep: thin protocol-overhead model; ComponentBase forwarding worker as-is; named-node contract for router helpers - 0039 pe_mmu: component + utility dual role; sub-page region stopgap; D2.1 flags pipeline path missing mmu.overhead_ns timeout (asymmetric with non-pipeline; not visible at default tlb_overhead_ns=0) - 0040 pe_tcm: dual-channel BW serialization (read/write Resource cap=1); TcmRequest schema owned by TCM; timing-only (no data store) - 0041 sram: terminal scratchpad model + ResponseMsg on reverse path; D1.1 flags _worker override as currently dormant (no Transaction actually targets the SRAM node today) - 0042 tiling: pure plan-generator module, not a component; corrects the G4 misclassification; pins GEMM/Math stage sequences and epilogue scope contract Also: /report skill G3 refinement — only flag older->newer asymmetric cross-references; newer->older (e.g., 0034-0037 citing infrastructure ADRs) are expected one-way and no longer reported. Bilingual pair verifier (tools/verify_adr_lang_pairs.py) passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,139 @@
|
||||
# ADR-0038: PCIE_EP Component Model
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-20).
|
||||
|
||||
Companion to ADR-0035 (M_CPU), ADR-0036 (IO_CPU), and
|
||||
ADR-0037 (Forwarding) at the same component-model level.
|
||||
|
||||
## First action
|
||||
|
||||
Pull one Transaction from `_inbox` and let `_forward_txn` invoke `run()`, which
|
||||
applies a single `env.timeout(node.attrs["overhead_ns"])` for PCIe protocol
|
||||
handling. After that the standard `ComponentBase` worker rules take over: if
|
||||
`next_hop` exists, put the advanced Transaction on `out_ports[next_hop]`;
|
||||
otherwise consume `drain_ns` and call `txn.done.succeed()`.
|
||||
|
||||
In other words, **PCIE_EP's first (and only) act is to spend the configured
|
||||
overhead as simulator time** — no routing decisions, no payload transformation,
|
||||
no MMIO decoding.
|
||||
|
||||
## Context
|
||||
|
||||
PCIE_EP is the **host ↔ device boundary** in the topology graph. The builder
|
||||
(`topology/builder.py`) creates an IO chiplet instance per SIP that contains
|
||||
`pcie_ep`, `io_cpu`, and `io_noc`, and lays bidirectional edges between the
|
||||
external `fabric.switch0` and each `pcie_ep`:
|
||||
|
||||
- `switch → pcie_ep`: host → device traffic (MemoryWrite, MemoryRead,
|
||||
KernelLaunch).
|
||||
- `pcie_ep → switch`: device-side outbound (e.g., cross-SIP IPCQ tokens).
|
||||
|
||||
Inside the IO chiplet there are bidirectional `pcie_ep ↔ io_noc` edges, and
|
||||
from there traffic branches to `io_cpu` or to the cube-side `hbm_ctrl` path
|
||||
(see ADR-0036 IO_CPU model). The router and resolver already know — per SPEC
|
||||
R7 — that PCIE_EP is the endpoint for memory operations, so helpers like
|
||||
`find_pcie_ep(sip)` and `find_memory_path(pcie_ep, dst_node)` treat PCIE_EP as
|
||||
the start (or end) of the memory path.
|
||||
|
||||
The problem is that all of this dependency lives in builder/router/resolver,
|
||||
while **PCIE_EP's own internal model has no ADR**. The consequence:
|
||||
|
||||
- "What latency does PCIE_EP model?" requires reading the source.
|
||||
- The asymmetry with peer components (IO_CPU = ADR-0036, M_CPU = ADR-0035) is
|
||||
awkward.
|
||||
- Future decisions about a more detailed PCIe link-layer model (TLP credits,
|
||||
retry, MPS chunking) lack a documented baseline.
|
||||
|
||||
This ADR pins down the current **thin PCIE_EP model** and records that this
|
||||
thinness is intentional (aligned with ADR-0033's latency-model simplification
|
||||
policy).
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. PCIE_EP uses ComponentBase's generic forwarding worker as-is
|
||||
|
||||
`PcieEpComponent` extends `ComponentBase` and does **not** override `_worker` or
|
||||
`_forward_txn`. Every Transaction flows through the standard sequence:
|
||||
|
||||
1. `_fan_in` accumulates inbound messages (and reassembles Flits, per ADR-0033
|
||||
Phase 2c) into `_inbox`.
|
||||
2. `_worker` pulls one message off `_inbox` and spawns
|
||||
`env.process(self._forward_txn(env, txn))` for per-message pipelining.
|
||||
3. `_forward_txn` calls the op_log start hook → `run()` for latency → op_log
|
||||
end hook.
|
||||
4. `run()` is a single line: `yield env.timeout(overhead_ns)`.
|
||||
5. If a next hop exists, `out_ports[next_hop].put(txn.advance())`. Otherwise
|
||||
(terminal arrival) consume `drain_ns` and call `txn.done.succeed()`.
|
||||
|
||||
### D2. The only timing parameter is `overhead_ns`
|
||||
|
||||
Only `node.attrs["overhead_ns"]` is accepted as a latency parameter. The code
|
||||
default is `0.0`; `topology.yaml`'s IOChiplet `components.pcie_ep.attrs`
|
||||
supplies the real value (current topology: `overhead_ns: 5.0` ns).
|
||||
|
||||
No separate BW-serialization resource (`simpy.Resource`), no queue depth, no
|
||||
retry model is introduced. Link-level BW serialization is handled wire-side —
|
||||
inside the IOChiplet by `pcie_ep_to_noc_bw_gbs = 256.0 GB/s`, and externally by
|
||||
the system's `io_ep_to_switch` link BW (ADR-0015 port/wire model). PCIE_EP
|
||||
itself takes no part in that accounting.
|
||||
|
||||
### D3. PCIE_EP is direction-aware in topology but direction-blind in code
|
||||
|
||||
The builder lays both `switch ↔ pcie_ep` and `pcie_ep ↔ io_noc` edges, so
|
||||
PCIE_EP serves:
|
||||
|
||||
- inbound (host → device): forward Transactions arriving from the switch onto
|
||||
io_noc-side next-hop.
|
||||
- outbound (device → host): forward Transactions arriving from io_noc/io_cpu
|
||||
back to the switch.
|
||||
|
||||
Both are handled by D1's generic forwarding worker; the component code never
|
||||
distinguishes direction (it just follows `txn.next_hop`).
|
||||
|
||||
### D4. PCIE_EP is not Flit-aware (legacy reassembly path)
|
||||
|
||||
`_FLIT_AWARE` is left at the inherited `False`, so `_fan_in` reassembles
|
||||
upstream-chunkified Flits into the parent Transaction before delivery to
|
||||
`_inbox` (aligned with ADR-0033 Phase 2c incremental rollout).
|
||||
|
||||
A future PCIe TLP-level credit model would revisit D4.
|
||||
|
||||
### D5. PCIE_EP is a **named node** for routing helpers
|
||||
|
||||
`policy/routing/router.py` provides `find_pcie_ep(sip, io_id="io0")`,
|
||||
`find_all_pcie_eps()`, and `find_memory_path(pcie_ep, dst_node)` — all of
|
||||
which treat PCIE_EP as the start (or end) of the memory path. The component
|
||||
itself supplies no information to these helpers; the naming convention
|
||||
(`sip{S}.{io_id}.pcie_ep`) is guaranteed by the topology builder.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. Full PCIe TLP-level model (credits, retry, MPS chunking)
|
||||
|
||||
Rejected. Violates ADR-0033's "current latency model = abstract overhead + BW
|
||||
serialization" simplification. Host↔device protocol fidelity is explicitly
|
||||
out-of-scope in SPEC §5 "Non-Goals".
|
||||
|
||||
### A2. Per-PCIE_EP `simpy.Resource` for in-flight cap
|
||||
|
||||
Rejected. Host traffic is not a contention bottleneck in current workloads.
|
||||
Defer to a separate ADR if it becomes one (in which case D1 stays and D2 is
|
||||
extended).
|
||||
|
||||
### A3. Merge PCIE_EP into IO_CPU
|
||||
|
||||
Rejected. PCIE_EP is the protocol-boundary node first hit on the host side;
|
||||
IO_CPU is the device-side control-plane processing node (ADR-0036). Traffic
|
||||
fan-out and command decoding costs concentrate in IO_CPU, while PCIE_EP only
|
||||
expresses link-edge overhead. Merging them would mix two responsibilities and
|
||||
violate the spirit of ADR-0007 (runtime API/sim_engine boundaries).
|
||||
|
||||
## Consequences
|
||||
|
||||
- PCIE_EP gets an explicit model ADR despite having near-zero code — consistent
|
||||
with peer component ADRs, lower maintenance friction.
|
||||
- Future PCIe-level refinement supersedes by extending D2/D4 in a new ADR.
|
||||
- D5 makes the named-node dependency explicit, so any future renaming of
|
||||
component IDs has a clearly bounded blast radius.
|
||||
@@ -0,0 +1,203 @@
|
||||
# ADR-0039: PE_MMU Component Model — Component + Utility Dual Role
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-20).
|
||||
|
||||
ADR-0011 (PA/VA/LA address model) only states that "the VA model translates
|
||||
VA→PA via PE_MMU"; this ADR pins down **the PE_MMU component's own behavior
|
||||
model**.
|
||||
|
||||
## First action
|
||||
|
||||
At construction, read `node.attrs["page_size"]` (default `2 MiB`) and
|
||||
`node.attrs["tlb_overhead_ns"]` (default `0.0`) and instantiate the internal
|
||||
`PeMMU` utility object (`policy.address.pe_mmu.PeMMU`) exactly once. That
|
||||
object is the single owner of the page table, the sub-page region lists, and
|
||||
the TLB overhead value.
|
||||
|
||||
At runtime the first action splits into two paths:
|
||||
|
||||
- **Component path (inbox consumption)**: `_worker` pulls a Transaction off
|
||||
`_inbox`; if `request` is a `MmuMapMsg`, call `self._mmu.map(va, pa, size)`
|
||||
for each entry and then `txn.done.succeed()`. For `MmuUnmapMsg`, call
|
||||
`unmap(va, size)`. Any other type falls through to standard `_forward_txn`.
|
||||
In other words, **the component's first act is "apply map/unmap commands to
|
||||
the page table"**.
|
||||
- **Utility path (direct call)**: a sibling PE engine (PE_DMA / PE_GEMM) calls
|
||||
`pe_mmu.mmu.translate(va)` directly. This path produces no SimPy events;
|
||||
the caller (when `overhead_ns > 0`) issues a `yield env.timeout(mmu.overhead_ns)`
|
||||
in its own process.
|
||||
|
||||
## Context
|
||||
|
||||
ADR-0011 defined three address models (PA/VA/LA) and agreed that "VA model =
|
||||
translation via PE_MMU". But in code, `PeMmuComponent` performs two
|
||||
complementary roles simultaneously:
|
||||
|
||||
1. **A topology-graph component**: it receives `MmuMapMsg` / `MmuUnmapMsg`
|
||||
sideband messages over the cube NoC and updates the page table.
|
||||
2. **A PE-local utility**: PE_DMA / PE_GEMM on the same PE call
|
||||
`translate(va)` directly with zero SimPy latency (the caller pays
|
||||
`overhead_ns` if any).
|
||||
|
||||
Without an ADR covering both roles, the following questions are ambiguous:
|
||||
|
||||
- "Why isn't there a SimPy event for the MMU translate?" (Answer: the caller
|
||||
pays it.)
|
||||
- What is the sub-page region model, and why? (The code docstring has it, but
|
||||
no ADR — only a memory note `project_mmu_subpage_stopgap`.)
|
||||
- Who sends map/unmap, and when must they be visible? (Ordering contract.)
|
||||
|
||||
Additionally, `PeMMU.map()` has "append, last-write-wins on overlap"
|
||||
semantics, which is impossible to express with a one-PA-per-entry page table.
|
||||
That is a deliberate **simulator stopgap** to support DPPolicy sub-page sharding
|
||||
(e.g., 128 B payloads against 4 KiB pages) without silent last-write-wins
|
||||
misrouting. This deviation from real HW MMU semantics must be ADR-pinned.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Explicit dual role — component and utility
|
||||
|
||||
`PeMmuComponent` exposes two interfaces from a single class:
|
||||
|
||||
- Component interface: `_inbox` consumption, `_worker` loop (handles MMU
|
||||
sideband messages).
|
||||
- Utility interface: the `mmu` property exposes the underlying `PeMMU` object,
|
||||
which PE_DMA / PE_GEMM hold directly and invoke `translate()` on.
|
||||
|
||||
The latter is **not a layer skip**: inside a PE, the engines and PE_MMU are
|
||||
siblings under the "components" layer (ADR-0007). Cross-layer violations only
|
||||
apply to runtime API ↔ sim_engine ↔ components boundaries.
|
||||
|
||||
### D2. Latency model — `translate()` is pure; caller owns the timeout
|
||||
|
||||
`PeMMU.translate()` is a pure function and yields nothing in SimPy. The caller
|
||||
(a PE engine) issues `if mmu.overhead_ns > 0: yield env.timeout(mmu.overhead_ns)`
|
||||
in its own process after translation.
|
||||
|
||||
Rationale: the PE engine process already holds its own `record_start` /
|
||||
`record_end` (op_log) hooks, so keeping timing inside the caller's process
|
||||
preserves consistent timing accounting. A separate MMU process would split the
|
||||
engine's processing flow and blur op_log / pipeline overlap semantics.
|
||||
|
||||
#### D2.1. Current implementation asymmetry — pipeline vs non-pipeline (known)
|
||||
|
||||
At the time of writing, `pe_dma.py` handles MMU overhead differently in its
|
||||
two call paths:
|
||||
|
||||
- **non-pipeline (`handle_command`)**: after `translate()`, applies
|
||||
`if self._mmu.overhead_ns > 0: yield env.timeout(self._mmu.overhead_ns)`.
|
||||
- **pipeline (`_do_pipeline_dma`)**: calls `translate()` only, **omitting**
|
||||
the overhead timeout — though the comment says "same logic as non-pipeline
|
||||
path", the behaviors differ.
|
||||
|
||||
In the default topology, `tlb_overhead_ns = 0.0`, so this asymmetry does not
|
||||
manifest. With `tlb_overhead_ns > 0`, however, GEMM/Math via the pipeline path
|
||||
appears MMU-overhead faster than the equivalent non-pipeline workload.
|
||||
|
||||
The D2 contract states that **all** callers pay the overhead; the pipeline
|
||||
omission is **not an intentional design** — ADR-0014 D6 (pipeline self-routing)
|
||||
does not exempt it. Remediation options (require a separate Phase 1/2):
|
||||
|
||||
- (a) Add `if mmu.overhead_ns > 0: yield env.timeout(...)` in
|
||||
`_do_pipeline_dma` to align with D2 — **preferred**.
|
||||
- (b) Narrow the D2 contract to "non-pipeline only" and document the pipeline
|
||||
exemption in an ADR-0014 update — discouraged, since it weakens the
|
||||
overhead's meaning.
|
||||
|
||||
This ADR recommends (a) and assumes a small follow-up change either before or
|
||||
just after acceptance.
|
||||
|
||||
### D3. Page table structure — sub-page region list (stopgap)
|
||||
|
||||
`self._table: dict[vpn, list[(start_in_page, end_in_page, pa_at_offset_zero)]]`
|
||||
holds multiple disjoint regions per page.
|
||||
|
||||
- `map(va, pa, size)`: append regions when the range crosses a page boundary.
|
||||
- `translate(va)`: look up regions for the VPN and iterate **in reverse** so
|
||||
the most recent overlapping region wins (last-write-wins).
|
||||
- `unmap(va, size)`: remove only regions whose extent is **fully contained**
|
||||
within the unmap range; partial-overlap boundaries are left in place and the
|
||||
caller is expected to unmap on the same boundaries used for map.
|
||||
|
||||
This is documented as a **simulator stopgap** that supplements the VA model
|
||||
from ADR-0011. It prevents silent last-write-wins misrouting when DPPolicy
|
||||
shards below page granularity. Memory note: `project_mmu_subpage_stopgap`.
|
||||
|
||||
### D4. PageFault signals PA fallback
|
||||
|
||||
If `translate()` is called with an unmapped VA, `PageFault` is raised. PE_DMA
|
||||
catches the exception and **uses the original address as a PA** (the PA-only
|
||||
backward-compatibility path from ADR-0011). PageFault is therefore not an
|
||||
error — it is the signal for "no VA mapping, interpret as PA".
|
||||
|
||||
This path is intentional and preserves backward compatibility with the
|
||||
ADR-0011 PA-only mode.
|
||||
|
||||
### D5. MMU sideband-message reception contract
|
||||
|
||||
`MmuMapMsg` / `MmuUnmapMsg` arrive over the fabric at PE_MMU's `_inbox`
|
||||
(SPEC R10: "MMU map installation incurs measured fabric latency"). Schemas
|
||||
live in `runtime_api/kernel.py`:
|
||||
|
||||
- `MmuMapMsg.entries: tuple[dict, ...]` — each dict is
|
||||
`{"va": int, "pa": int, "size": int}`.
|
||||
- `MmuUnmapMsg.entries: tuple[dict, ...]` — each dict is
|
||||
`{"va": int, "size": int}`.
|
||||
|
||||
PE_MMU reception flow:
|
||||
|
||||
1. `_worker` does `_inbox.get()` for one message.
|
||||
2. `hasattr(msg, "request")` confirms a Transaction wrapper.
|
||||
3. `isinstance(msg.request, MmuMapMsg)` → for each entry, call
|
||||
`self._mmu.map(va=e["va"], pa=e["pa"], size=e["size"])`.
|
||||
4. `isinstance(msg.request, MmuUnmapMsg)` → for each entry, call
|
||||
`self._mmu.unmap(va=e["va"], size=e["size"])`.
|
||||
5. Both signal `msg.done.succeed()` after completion.
|
||||
|
||||
An external caller (runtime API) `await`ing `done` therefore receives a SimPy
|
||||
guarantee that "the mapping is installed on-device" — this is the realization
|
||||
of ADR-0011's "MMU map installation incurs measured fabric latency".
|
||||
|
||||
This ADR does **not** define the **sender or fan-out policy** for the sideband
|
||||
message — those are runtime API responsibilities. Only the receive contract
|
||||
belongs here.
|
||||
|
||||
### D6. Non-MMU Transactions delegate to generic forwarding
|
||||
|
||||
If a message pulled from `_inbox` is not `MmuMapMsg` / `MmuUnmapMsg` (or
|
||||
lacks a `request` attribute), `_forward_txn` handles it normally. This keeps
|
||||
the door open for future topologies where PE_MMU sits on a pass-through path —
|
||||
current code never sends such traffic, but the routing remains safe.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. Make `translate()` a SimPy generator
|
||||
|
||||
Rejected. As D2 explains, this blurs op_log / pipeline overlap accounting in
|
||||
the PE engine.
|
||||
|
||||
### A2. Use small page size (e.g., 128 B) instead of sub-page regions
|
||||
|
||||
Rejected. Would explode page-table memory and cube-wide map message size. Most
|
||||
mappings are 2 MiB; pushing the page size below that for the few DPPolicy
|
||||
sharding cases inflates average cost.
|
||||
|
||||
### A3. Make PE_MMU a PE_CPU helper only (not a topology node)
|
||||
|
||||
Rejected. ADR-0011 requires that MMU map installation incur measured fabric
|
||||
latency (via `MmuMapMsg`), which requires PE_MMU to be a node on the graph.
|
||||
It also keeps cube NoC visualizer output consistent.
|
||||
|
||||
## Consequences
|
||||
|
||||
- PE_MMU's dual role is justified at ADR level, so future "unify into one"
|
||||
refactor pressure has a documented counterpoint.
|
||||
- The sub-page region model is explicitly labeled a stopgap, providing a
|
||||
basis for deprecating it when LA model (ADR-0011) lands.
|
||||
- The "`translate()` does not yield" contract is locked in (D2), so any
|
||||
future proposal to add an internal MMU timeout can be denied with a
|
||||
documented rationale.
|
||||
- PA fallback (D4) is normalized, preventing defensive logic from treating
|
||||
PageFault as an error.
|
||||
@@ -0,0 +1,149 @@
|
||||
# ADR-0040: PE_TCM Component Model — Dual-Channel BW Serialization
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-20).
|
||||
|
||||
ADR-0014 (PE Pipeline Execution Model, D1) references PE_TCM as a "BW-based
|
||||
serialized scratchpad memory" but does not pin down the component's own model.
|
||||
This ADR fills that gap.
|
||||
|
||||
## First action
|
||||
|
||||
When `start()` is invoked, immediately create two `simpy.Resource(env, capacity=1)`
|
||||
instances and store them in `self._read_res` / `self._write_res`. These two
|
||||
resources are the single decision points that serialize the **read channel**
|
||||
and **write channel** to one in-flight request each.
|
||||
|
||||
The runtime first action: `_worker` pulls a message off `_inbox` and branches
|
||||
by type:
|
||||
|
||||
- `TcmRequest` (from `pe_fetch_store`): spawn `env.process(self._handle_tcm_request)`.
|
||||
Hence **TCM's first act is "acquire the lock matching the direction
|
||||
(read/write)"**. After lock acquisition, if `bw > 0 and nbytes > 0`, yield
|
||||
`env.timeout(delay_ns = nbytes / bw)`, then `req.done.succeed()`.
|
||||
- Anything else (Transaction): spawn `env.process(self._forward_txn)` (legacy
|
||||
fabric pass-through).
|
||||
|
||||
At construction, `node.attrs["read_bw_gbs"]` and `node.attrs["write_bw_gbs"]`
|
||||
(default `512.0 GB/s` each) are captured and held.
|
||||
|
||||
## Context
|
||||
|
||||
In the PE pipeline (ADR-0014 D1, D6), PE_TCM receives two kinds of traffic:
|
||||
|
||||
1. **`TcmRequest` from PE_FETCH_STORE** — when moving data between TCM and
|
||||
the register file, PE_FETCH_STORE sends a short sideband request to obtain
|
||||
BW-serialized access latency (`direction = "read"` or `"write"`, `nbytes`,
|
||||
`done` event).
|
||||
2. **Legacy Transaction forwarding** — a fallback in case TCM ends up as a
|
||||
pass-through node on the fabric graph (not used by the current critical
|
||||
path, but preserved).
|
||||
|
||||
The problem: ADR-0014 only says "BW-based serialization" without specifying:
|
||||
|
||||
- Read and write are **independent channels** running in parallel; only
|
||||
same-direction concurrency serializes at `capacity=1`.
|
||||
- BW is split into two configurable values (`read_bw_gbs` / `write_bw_gbs`).
|
||||
- The formula is `delay_ns = nbytes / bw_gbs` (loose unit convention:
|
||||
GB/s × ns ≈ B).
|
||||
- `nbytes == 0` still acquires the lock but skips the BW term.
|
||||
- `run()`'s `overhead_ns` (default `0.0`) is only used in the legacy fabric
|
||||
forwarding path.
|
||||
|
||||
Each of these requires an ADR. In particular, "why are read and write
|
||||
separate channels" and "who owns the BW values" must be documented so that
|
||||
future changes (e.g., `capacity=2`) have a clear basis.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Dual channel — read and write are independent resources
|
||||
|
||||
`_read_res = simpy.Resource(env, capacity=1)`,
|
||||
`_write_res = simpy.Resource(env, capacity=1)`.
|
||||
Same-direction concurrent requests queue on the resource and serialize;
|
||||
opposite-direction requests proceed in parallel. This matches the hardware
|
||||
model where TCM has a dual-port (read + write) configuration, and it allows
|
||||
the simulator to express the GEMM-pipeline case where fetch (read) and store
|
||||
(write) overlap in time — modeled as BW-serialized inside each direction but
|
||||
independent across directions.
|
||||
|
||||
### D2. Per-channel BW model — `nbytes / bw_gbs`
|
||||
|
||||
After lock acquisition, if `nbytes > 0 and bw > 0`, yield
|
||||
`env.timeout(nbytes / bw_gbs)`. The unit convention is GB/s × ns ≈ B,
|
||||
consistent with the simulator-wide loose convention (see ADR-0033).
|
||||
|
||||
- `nbytes == 0`: BW term is zero, but the lock is acquired and released. This
|
||||
is intentional: when a plan generator emits an empty fetch/store on the
|
||||
PE_FETCH_STORE side, the op_log / channel accounting on the TCM side still
|
||||
records one consumption.
|
||||
- `bw == 0` (config error): the timeout call is skipped (0-time pass). Should
|
||||
not occur with normal settings.
|
||||
|
||||
### D3. BW values come from `node.attrs.read_bw_gbs` / `write_bw_gbs`
|
||||
|
||||
Defaults `512.0 GB/s`. The topology builder (`topology/builder.py`) passes
|
||||
these attrs when instantiating TCM from `pe_template`. Default changes should
|
||||
coincide with related decisions in ADR-0014 D1 or ADR-0033.
|
||||
|
||||
### D4. TcmRequest schema is owned by PE_TCM
|
||||
|
||||
`@dataclass TcmRequest(direction: str, nbytes: int, done: simpy.Event, tag: str = "")`
|
||||
lives in `components/builtin/pe_tcm.py`. PE_FETCH_STORE imports the dataclass
|
||||
and only constructs/sends it. The caller does not define the schema because:
|
||||
|
||||
- The meaning of BW serialization is TCM's responsibility — TCM decides which
|
||||
fields drive serialization.
|
||||
- The valid-value check for `direction` (must be `"read"` or `"write"`) lives
|
||||
in `_handle_tcm_request`'s if/else branch.
|
||||
|
||||
### D5. Legacy Transaction forwarding path is preserved
|
||||
|
||||
When `_worker` receives a non-`TcmRequest` message, it dispatches to
|
||||
`_forward_txn`, applying `run()`'s `overhead_ns`. The current standard PE
|
||||
pipeline does not route Transactions through TCM, but the path is kept to
|
||||
avoid breakage if fabric topology changes.
|
||||
|
||||
This path is accounted for via standard Transaction op_log; the BW channel
|
||||
locks are **not** acquired (orthogonal to D1's usage).
|
||||
|
||||
### D6. PE_TCM is not a data store (timing only)
|
||||
|
||||
TCM models **time only**. The actual data payload is held by sim_engine's
|
||||
`memory_store` (when present); the TCM component never updates it.
|
||||
PE_FETCH_STORE obtains BW delay through `TcmRequest`, and register contents
|
||||
are handled separately in the data path (ADR-0020 2-pass data execution —
|
||||
Phase 2).
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. Single channel (`capacity=2` for shared read+write)
|
||||
|
||||
Rejected. Would artificially serialize the normal-case overlap of fetch
|
||||
(read) and store (write) and yield an incorrect BW upper bound for the PE
|
||||
pipeline.
|
||||
|
||||
### A2. `capacity > 1` (e.g., 2-banked TCM)
|
||||
|
||||
Rejected. Current hardware model assumes a single bank. Multi-bank extension
|
||||
needs its own ADR that would supersede D1. Bumping capacity now would loosen
|
||||
the nominal serialization without raising the BW upper bound, producing less
|
||||
accurate modeling.
|
||||
|
||||
### A3. Generalize BW formula to `nbytes / bw + overhead_ns`
|
||||
|
||||
Rejected. `overhead_ns` is reserved for the legacy forwarding path (D5).
|
||||
Additional fetch/store-path overhead, if needed, belongs in PE_FETCH_STORE's
|
||||
`run()` or in a register-file access model — closer to the responsibility
|
||||
boundary.
|
||||
|
||||
## Consequences
|
||||
|
||||
- TCM's BW accounting is locked at ADR level. Questions arising from op_log
|
||||
in GEMM/Math sweeps — "why did fetch and store overlap?", "why do only
|
||||
same-direction requests serialize?" — resolve quickly to D1.
|
||||
- Future multi-bank TCM models or asymmetric read/write BW changes have a
|
||||
clear blast radius (D1 / D2 / D3 — pick one).
|
||||
- D6 ("TCM is not a data store") sharpens the responsibility boundary with
|
||||
ADR-0020 2-pass execution.
|
||||
@@ -0,0 +1,195 @@
|
||||
# ADR-0041: Cube SRAM Component Model — terminal scratchpad on cube NoC
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-20).
|
||||
|
||||
ADR-0017 (Cube NOC and HBM Connectivity) describes SRAM as a cube-NoC
|
||||
attachment but does not specify the SRAM component's own latency / response
|
||||
model. This ADR fills that gap.
|
||||
|
||||
## First action
|
||||
|
||||
Inside `_worker`, immediately after pulling a Transaction off `_inbox`, the
|
||||
very first action is `yield from self.run(env, txn.nbytes)`. Inside `run()`,
|
||||
the component applies `env.timeout(node.attrs["overhead_ns"])`
|
||||
(default `0.0`).
|
||||
|
||||
In short, **SRAM's first act is "express access overhead as simulator time"**.
|
||||
After overhead, the worker yields `drain_ns` (the terminal BW-serialization
|
||||
cost stamped on the Transaction) and then constructs and dispatches a
|
||||
`ResponseMsg` on the reverse path.
|
||||
|
||||
This differs from a generic `ComponentBase._worker`: SRAM knows it is a
|
||||
**terminal node**, so it does not go through `_forward_txn`. Its own worker
|
||||
explicitly performs `run → drain → _send_response`.
|
||||
|
||||
## Context
|
||||
|
||||
The cube topology (`topology/builder.py`) creates the following named nodes
|
||||
per cube:
|
||||
|
||||
- `sip{S}.cube{C}.m_cpu`
|
||||
- `sip{S}.cube{C}.sram`
|
||||
- `sip{S}.cube{C}.hbm_ctrl` (per-PE partitions)
|
||||
- `sip{S}.cube{C}.pe{P}` (and its PE-internal sub-components)
|
||||
|
||||
SRAM is one of the cube-NoC attachments — `topology/mesh_gen.py` assigns it
|
||||
to the nearest router by placement coordinates and adds `"sram"` to that
|
||||
router's `attach` list. The builder lays bidirectional `sram ↔ router` edges
|
||||
(BW: `sram_to_router_bw_gbs`, default `128.0 GB/s`).
|
||||
|
||||
SRAM has two intertwined roles:
|
||||
|
||||
1. **Fabric terminal**: the endpoint for cube-NoC memory-access Transactions
|
||||
destined for SRAM. SRAM consumes access overhead + drain, then sends a
|
||||
response back on the reverse path.
|
||||
2. **One of the IPCQ slot tiers**: ADR-0023 D9.7 defines
|
||||
`buffer_kind ∈ {tcm, sram, hbm}`; the `sram` tier's per-access cost is
|
||||
`(512.0 GB/s, 2.0 ns)` in `common/ipcq_types._BUFFER_KIND_BW`. This is
|
||||
separate from the SRAM node's `overhead_ns` attr; PE_DMA accounts for it
|
||||
directly at the IPCQ slot-write moment.
|
||||
|
||||
Without an ADR covering both roles, the following questions are ambiguous:
|
||||
|
||||
- "What latency does SRAM model?" — fabric drain + overhead, or the IPCQ
|
||||
tier slot latency? — answers scatter.
|
||||
- What does the `size_mb` (`32`) attr mean in the future? Currently it is not
|
||||
used; SRAM only models timing.
|
||||
- Which cube router does SRAM attach to? (placement-based; lives in topology
|
||||
code only.)
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. SRAM is a terminal scratchpad node on the cube NoC
|
||||
|
||||
`SramComponent` extends `ComponentBase` but overrides `_worker` to express
|
||||
terminal semantics directly:
|
||||
|
||||
```
|
||||
while True:
|
||||
txn = yield self._inbox.get()
|
||||
yield from self.run(env, txn.nbytes) # overhead_ns
|
||||
if drain_ns > 0: yield env.timeout(drain_ns)
|
||||
yield from self._send_response(env, txn)
|
||||
```
|
||||
|
||||
This pattern is necessary because SRAM must know the reverse path; the
|
||||
generic `_forward_txn` (which forwards to the next hop) does not fit a
|
||||
terminal.
|
||||
|
||||
#### D1.1. Currently dormant — the `_worker` override is an unused path
|
||||
|
||||
At the time of writing, **no component actually sends a Transaction to the
|
||||
SRAM node**. The verified references to the SRAM node ID are:
|
||||
|
||||
- `policy/routing/router.py` and friends — guarantee path lookups.
|
||||
- `components/builtin/pe_dma.py::_handle_ipcq_inbound` — for
|
||||
`buffer_kind == "sram"`, computes the *path* to
|
||||
`bank_node = f"{cube_prefix}.sram"` via `compute_drain_ns(path, ...)` and
|
||||
yields a **local** timeout. The Transaction itself does not flow to the
|
||||
SRAM node (see D4).
|
||||
- `tests/test_routing.py` — checks connectivity via
|
||||
`find_path("sip0.cube0.pe0", "sip0.cube0.sram")`.
|
||||
|
||||
So the `_worker` / `_send_response` override is currently a **dormant code
|
||||
path**. It is preserved deliberately:
|
||||
|
||||
- Topology changes that route fabric Transactions to SRAM terminally (e.g.,
|
||||
explicit M_CPU → SRAM accesses) would activate it immediately.
|
||||
- ADR-0017's "cube-attached scratchpad" semantics naturally implies terminal
|
||||
behavior; the override is an intentional placeholder.
|
||||
|
||||
A future ADR (or a revision to this one) will mark dormancy resolved when an
|
||||
actual sender is added.
|
||||
|
||||
### D2. ResponseMsg construction and reverse-path dispatch
|
||||
|
||||
`_send_response`:
|
||||
|
||||
1. `reverse_path = list(reversed(txn.path))` — derive the reverse path.
|
||||
2. Construct `ResponseMsg(correlation_id=txn.request.correlation_id,
|
||||
request_id=..., src_cube=<this cube>, src_pe=-1, success=True)`.
|
||||
3. Wrap in `Transaction(request=resp_msg, path=reverse_path, step=0,
|
||||
nbytes=0, done=env.event(), is_response=True)` and put on
|
||||
`out_ports[reverse_path[1]]`.
|
||||
4. If the reverse path is too short (`< 2 hops`) or `ctx` is absent, fall
|
||||
back to calling the original `txn.done.succeed()`.
|
||||
|
||||
`src_pe = -1` means "SRAM is not PE-localized". `src_cube` is parsed from the
|
||||
node ID (`sip{S}.cube{C}.sram`).
|
||||
|
||||
### D3. Timing parameters: `overhead_ns` and wire-side `drain_ns`
|
||||
|
||||
- **Component-side latency**: `node.attrs["overhead_ns"]`. Default topology
|
||||
uses `2.0 ns`.
|
||||
- **Link-side serialization**: `drain_ns` arrives stamped on the Transaction
|
||||
— the wire-side BW serialization result from ADR-0015. SRAM only yields it.
|
||||
- The `size_mb` (default `32 MiB`) attr is currently timing-neutral. If a
|
||||
capacity-aware model is added in the future, a separate ADR will give it
|
||||
meaning.
|
||||
|
||||
### D4. IPCQ slot accounting is not modeled by the SRAM component
|
||||
|
||||
Per ADR-0023 D9.7, the IPCQ slot-write latency for the SRAM tier is incurred
|
||||
inside PE_DMA's `_handle_ipcq_inbound`, which calls
|
||||
`slot_io_latency_ns("sram", nbytes)` using `_BUFFER_KIND_BW["sram"]`. That is:
|
||||
|
||||
- When SRAM receives a fabric Transaction (D1, D2, D3 apply), it processes
|
||||
normally.
|
||||
- When an IPCQ slot lives on SRAM, PE_DMA pays the slot-write time directly —
|
||||
independent of the SRAM component.
|
||||
|
||||
This separation is intentional: IPCQ is a fast path (sub-cycle slot
|
||||
bookkeeping) and does not traverse fabric Transactions, so SRAM does not need
|
||||
to know about IPCQ.
|
||||
|
||||
### D5. SRAM's cube-NoC attachment is placement-driven
|
||||
|
||||
`topology/mesh_gen.py` reads `placement.sram.pos_mm` (default `[1.5, 9.0]` in
|
||||
`topology.yaml`) and adds `"sram"` to the nearest router's `attach`. The
|
||||
builder (`topology/builder.py`'s attachment loop) then lays bidirectional
|
||||
edges between the `sram` node and that router.
|
||||
|
||||
This decision lives outside the SRAM component (mesh_gen / builder); the
|
||||
component does not know which router it sits on. It only relies on
|
||||
`txn.path` / `reverse_path` to reach it via a router.
|
||||
|
||||
### D6. SRAM is not a data store (timing only)
|
||||
|
||||
Same context as ADR-0040 D6: the SRAM component models time only; the data
|
||||
payload (if any) lives in sim_engine's `memory_store`.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. Use `_forward_txn` and route responses via separate nodes (à la IO_CPU / HBM_CTRL)
|
||||
|
||||
Rejected. SRAM is a terminal on the cube NoC; adding a response node would
|
||||
introduce meaningless hops and violate ADR-0017's simplification spirit.
|
||||
|
||||
### A2. Model BW serialization inside SRAM with its own resource
|
||||
|
||||
Rejected. Wire-side BW serialization (`drain_ns`) already captures it. An
|
||||
internal `simpy.Resource` would double-count against ADR-0015 (port/wire
|
||||
model).
|
||||
|
||||
### A3. Handle IPCQ slot accounting in the SRAM component
|
||||
|
||||
Rejected. As D4 makes explicit, IPCQ is a fast path that does not traverse
|
||||
fabric Transactions. If SRAM knew about IPCQ, the responsibility would split
|
||||
across two places and obscure reasoning.
|
||||
|
||||
### A4. Capacity-aware latency from `size_mb`
|
||||
|
||||
Rejected for now. The capacity is currently a visualizer label; introducing
|
||||
a capacity-aware timing model requires a dedicated ADR.
|
||||
|
||||
## Consequences
|
||||
|
||||
- SRAM's timing model is pinned at ADR level as
|
||||
`overhead_ns + drain_ns + ResponseMsg(reverse_path)`. Any proposal to push
|
||||
IPCQ slot latency into the SRAM component can be refused with D4.
|
||||
- D3 records that `size_mb` is timing-neutral today, so a future
|
||||
capacity-aware model has a narrow compatibility scope.
|
||||
- D5 documents the placement-driven attachment, so changes to the SRAM
|
||||
coordinate have a clearly bounded impact (`mesh_gen` only).
|
||||
@@ -0,0 +1,199 @@
|
||||
# ADR-0042: Tile Plan Generators — GEMM/Math Pipeline Plan Builders
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-20).
|
||||
|
||||
This ADR pins down `tiling.py` as a **plan-generator
|
||||
module**, not a SimPy component.
|
||||
|
||||
ADR-0014 (PE Pipeline Execution Model) D6 (tile plan / self-routing) does not
|
||||
specify the tile-plan generation algorithm itself; this ADR fills that gap.
|
||||
|
||||
## First action
|
||||
|
||||
When `generate_gemm_plan(M, K, N, tile_m, tile_k, tile_n, ..., pe_prefix,
|
||||
a_pinned, b_pinned, epilogue_specs)` is called, the very first action is
|
||||
**computing tile counts and constructing the PE-component ID strings**:
|
||||
|
||||
```
|
||||
M_tiles = max(1, ceil(M / tile_m))
|
||||
K_tiles = max(1, ceil(K / tile_k))
|
||||
N_tiles = max(1, ceil(N / tile_n))
|
||||
dma_id = f"{pe_prefix}.pe_dma"
|
||||
fetch_id = f"{pe_prefix}.pe_fetch_store"
|
||||
gemm_id = f"{pe_prefix}.pe_gemm"
|
||||
math_id = f"{pe_prefix}.pe_math"
|
||||
```
|
||||
|
||||
In short, **the plan generator's first act is "compute ceiling tile counts
|
||||
and assemble the four sub-component IDs for this PE once"**. No SimPy event
|
||||
or environment is touched — this module is a pure function.
|
||||
|
||||
`generate_math_plan(M, N, tile_m, tile_n, ..., math_op, src_addr, dst_addr,
|
||||
pe_prefix)` likewise begins by computing `M_tiles`, `N_tiles` and assembling
|
||||
three component IDs (`dma_id`, `fetch_id`, `math_id`).
|
||||
|
||||
## Context
|
||||
|
||||
ADR-0014 D6 agreed that "PE_SCHEDULER, on receiving a CompositeCmd, generates
|
||||
a TilePlan and feeds self-routing tile tokens". But the **concrete plan
|
||||
generation algorithm** lives in `src/kernbench/components/builtin/tiling.py`,
|
||||
which:
|
||||
|
||||
- Defines no component — it is a pair of **pure functions**
|
||||
(`generate_gemm_plan`, `generate_math_plan`).
|
||||
- Does not depend on the SimPy environment, queues, op_log, or hooks.
|
||||
- Returns a `PipelinePlan` (dataclass).
|
||||
|
||||
The original G4 analysis incorrectly described `tiling.py` as a component;
|
||||
it is in fact a plan-builder helper consumed by PE_SCHEDULER. Pinning this
|
||||
down in its own ADR (paired with ADR-0014 D6) prevents:
|
||||
|
||||
- Ambiguity over whether plan generation belongs to PE_SCHEDULER or a
|
||||
separate module.
|
||||
- Inconsistent rationale for stage sequences (e.g., FETCH/STORE position)
|
||||
between GEMM and Math plans.
|
||||
- Undocumented branching rationale for `a_pinned` / `b_pinned` /
|
||||
`epilogue_specs`.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. `tiling` is a pure plan-generator module, not a component
|
||||
|
||||
`components/builtin/tiling.py` defines no `ComponentBase` subclass. It exports
|
||||
two module-level functions:
|
||||
|
||||
- `generate_gemm_plan(...) -> PipelinePlan`
|
||||
- `generate_math_plan(...) -> PipelinePlan`
|
||||
|
||||
There is no `tiling` node in the topology graph. It lives in `builtin/`
|
||||
because it is a direct helper for PE_SCHEDULER (ADR-0014 D6) and is
|
||||
conceptually a PE_SCHEDULER internal utility.
|
||||
|
||||
### D2. GEMM plan stage sequence — `M → N → K` order
|
||||
|
||||
For each `(m, n, k)` tile (default — no operand pinning, no epilogue):
|
||||
|
||||
```
|
||||
[DMA_READ(A)] → [DMA_READ(B)] → FETCH → GEMM
|
||||
↑
|
||||
↓
|
||||
(last k tile only) [MATH(output_tile)]* → STORE → DMA_WRITE
|
||||
```
|
||||
|
||||
`k_tile` epilogue inserts a MATH stage immediately after GEMM on every
|
||||
K-tile; `output_tile` epilogue inserts MATH stages once per `(m, n)` after
|
||||
the final K-tile but before STORE/DMA_WRITE. The K-loop accumulator stays
|
||||
in the register file across K-tiles — STORE/DMA_WRITE happens only when
|
||||
`last_k`.
|
||||
|
||||
### D3. Operand pinning — `a_pinned` / `b_pinned`
|
||||
|
||||
If a caller passes `a_pinned=True`, **the A DMA_READ is omitted from every
|
||||
(m, n, k) tile**. Semantically: the caller (e.g., `tl.composite`) has already
|
||||
staged all of A in TCM via a prior `tl.load`, and signals so to the plan
|
||||
generator.
|
||||
|
||||
The branch is made at plan time (not at runtime). Therefore the stage record
|
||||
count in op_log changes deterministically with pinning, and sweep analyses
|
||||
(e.g., gemm_sweep's stage record count) see this decision directly.
|
||||
|
||||
### D4. Epilogue scope — `k_tile` vs `output_tile`
|
||||
|
||||
`epilogue_specs` is an iterable of op-spec objects. Each op object is
|
||||
expected to have:
|
||||
|
||||
- `op.kind: str` — math op name (e.g., `"dequant"`, `"bias"`, `"relu"`,
|
||||
`"scale"`). Placed into the stage's `params["op_kind"]`.
|
||||
- `op.scope: Scope` — `Scope.K_TILE` or `Scope.OUTPUT_TILE` (`Scope` enum
|
||||
in `kernbench.common.pe_commands`).
|
||||
- Op-specific extras (e.g., `bias`, `scale`, `factor`) — currently not used
|
||||
by the plan generator; consumed at runtime by PE_MATH.
|
||||
|
||||
The plan generator partitions by `getattr(o, "scope", None)`:
|
||||
|
||||
- `scope == Scope.K_TILE`: adds a MATH stage right after GEMM on every K-tile.
|
||||
- `scope == Scope.OUTPUT_TILE`: adds MATH stages just before STORE on the
|
||||
last K-tile per `(m, n)`.
|
||||
|
||||
Ops with neither `scope` value (e.g., missing attribute) are **dropped
|
||||
silently** — `getattr(..., None) == Scope.X` is False for both. Picking a
|
||||
default (`output_tile`) is the **caller's responsibility** (e.g.,
|
||||
`tl.composite`), not the plan generator's. This aligns with ADR-0014's
|
||||
composite epilogue contract.
|
||||
|
||||
`Scope` is imported lazily inside the function to avoid the circular path
|
||||
`pe_commands ← pe_types ← tiling`. This is intentional and not a refactor
|
||||
target — keeping `tiling` free of compile-time `pe_commands` dependencies
|
||||
preserves the module boundary (D1).
|
||||
|
||||
### D5. Math plan stage sequence — `M → N` order
|
||||
|
||||
For each `(m, n)` tile:
|
||||
|
||||
```
|
||||
DMA_READ → FETCH → MATH → STORE → DMA_WRITE
|
||||
```
|
||||
|
||||
There is no K dimension, so concepts like epilogue or accumulator residency
|
||||
do not apply. PE_FETCH_STORE's register-file accounting follows the same
|
||||
pattern as the GEMM plan.
|
||||
|
||||
### D6. Plans are data — no SimPy dependency
|
||||
|
||||
`PipelinePlan` is a dataclass in `pe_types.py` holding `tiles:
|
||||
list[TilePlan]`. Each `TilePlan` holds `stages: tuple[Stage, ...]`. The plan
|
||||
itself is near-immutable (only `Stage.params: dict` is mutable) and holds no
|
||||
SimPy objects.
|
||||
|
||||
At runtime, PE_SCHEDULER consumes the plan's first stage, builds a `TileToken`,
|
||||
and feeds it into the pipeline. The TileToken carries `plan: TilePlan`,
|
||||
`stage_idx: int`, and a cached `params: dict`. Self-routing proceeds by
|
||||
`TileToken.advance()` caching the next stage's `params` (ADR-0014 D6).
|
||||
|
||||
### D7. Plan generator contract — pure, deterministic, idempotent
|
||||
|
||||
Two calls with identical inputs return identical `PipelinePlan` instances
|
||||
(including `TilePlan.stages` order). This contract aligns with ADR-0014 D6's
|
||||
"deterministic tile dispatch order".
|
||||
|
||||
No side effects (no SimPy events, no file I/O, no global state) — tests can
|
||||
call the generators directly without an environment object (some cases in
|
||||
`tests/test_pe_pipeline.py` rely on this).
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. Make tiling a component (e.g., PE_PLANNER)
|
||||
|
||||
Rejected. Plan generation consumes no SimPy time — it is a pure decision
|
||||
algorithm. Making it a component would (a) add unnecessary infrastructure
|
||||
(inbox, resources), and (b) split PE_SCHEDULER's flow into "receive plan"
|
||||
plus "feed tiles", inserting a meaningless hop.
|
||||
|
||||
### A2. Move plan generation into PE_SCHEDULER as methods
|
||||
|
||||
Rejected (currently). Module separation provides (1) testability and
|
||||
(2) extensibility for additional plan algorithms (e.g., DTensor-aware) —
|
||||
add a new function. If plan kinds proliferate enough to require explicit
|
||||
dispatch, a future ADR can introduce a plan factory on PE_SCHEDULER.
|
||||
|
||||
### A3. Make plans fully immutable (frozen dataclass + tuple)
|
||||
|
||||
Partially adopted. `Stage` and `TilePlan` are dataclasses but not frozen,
|
||||
because `Stage.params: dict` is populated at plan-generation time and read
|
||||
at runtime (cached by TileToken on advance). Moving dict → frozendict pays
|
||||
migration cost without enough benefit. Convention: do not mutate after
|
||||
generation.
|
||||
|
||||
## Consequences
|
||||
|
||||
- `tiling.py` is documented as a plan-generator module, not a component —
|
||||
preempting future G4-style "this component lacks an ADR" analyses.
|
||||
- The GEMM plan's stage sequence (D2) and pinning / epilogue branching
|
||||
(D3 / D4) are pinned, providing a clear interpretation basis for sweep
|
||||
analyses (e.g., `scripts/gemm_sweep.py`'s stage record counts).
|
||||
- The plan generator's pure contract (D7) enables environment-free testing
|
||||
in line with ADR-0013 (verification strategy).
|
||||
- Future plan kinds (DTensor-aware, K-major, ...) follow D1 / D6 / D7 as a
|
||||
baseline — just add a new function.
|
||||
Reference in New Issue
Block a user