ADR: add 0038-0042 (pcie_ep, pe_mmu, pe_tcm, sram, tiling)

Fill component-model coverage gaps surfaced by /report's G4 analysis.
Each ADR documents the component's First action, latency model, and
honest notes on dormant code or implementation asymmetries discovered
during re-evaluation against current code.

- 0038 pcie_ep: thin protocol-overhead model; ComponentBase forwarding
  worker as-is; named-node contract for router helpers
- 0039 pe_mmu: component + utility dual role; sub-page region stopgap;
  D2.1 flags pipeline path missing mmu.overhead_ns timeout (asymmetric
  with non-pipeline; not visible at default tlb_overhead_ns=0)
- 0040 pe_tcm: dual-channel BW serialization (read/write Resource cap=1);
  TcmRequest schema owned by TCM; timing-only (no data store)
- 0041 sram: terminal scratchpad model + ResponseMsg on reverse path;
  D1.1 flags _worker override as currently dormant (no Transaction
  actually targets the SRAM node today)
- 0042 tiling: pure plan-generator module, not a component; corrects
  the G4 misclassification; pins GEMM/Math stage sequences and
  epilogue scope contract

Also: /report skill G3 refinement — only flag older->newer asymmetric
cross-references; newer->older (e.g., 0034-0037 citing infrastructure
ADRs) are expected one-way and no longer reported.

Bilingual pair verifier (tools/verify_adr_lang_pairs.py) passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-20 14:43:03 -07:00
parent 049e3d8bb3
commit 1f36baa898
11 changed files with 1747 additions and 3 deletions
@@ -0,0 +1,139 @@
# ADR-0038: PCIE_EP Component Model
## Status
Accepted (2026-05-20).
Companion to ADR-0035 (M_CPU), ADR-0036 (IO_CPU), and
ADR-0037 (Forwarding) at the same component-model level.
## First action
Pull one Transaction from `_inbox` and let `_forward_txn` invoke `run()`, which
applies a single `env.timeout(node.attrs["overhead_ns"])` for PCIe protocol
handling. After that the standard `ComponentBase` worker rules take over: if
`next_hop` exists, put the advanced Transaction on `out_ports[next_hop]`;
otherwise consume `drain_ns` and call `txn.done.succeed()`.
In other words, **PCIE_EP's first (and only) act is to spend the configured
overhead as simulator time** — no routing decisions, no payload transformation,
no MMIO decoding.
## Context
PCIE_EP is the **host ↔ device boundary** in the topology graph. The builder
(`topology/builder.py`) creates an IO chiplet instance per SIP that contains
`pcie_ep`, `io_cpu`, and `io_noc`, and lays bidirectional edges between the
external `fabric.switch0` and each `pcie_ep`:
- `switch → pcie_ep`: host → device traffic (MemoryWrite, MemoryRead,
KernelLaunch).
- `pcie_ep → switch`: device-side outbound (e.g., cross-SIP IPCQ tokens).
Inside the IO chiplet there are bidirectional `pcie_ep ↔ io_noc` edges, and
from there traffic branches to `io_cpu` or to the cube-side `hbm_ctrl` path
(see ADR-0036 IO_CPU model). The router and resolver already know — per SPEC
R7 — that PCIE_EP is the endpoint for memory operations, so helpers like
`find_pcie_ep(sip)` and `find_memory_path(pcie_ep, dst_node)` treat PCIE_EP as
the start (or end) of the memory path.
The problem is that all of this dependency lives in builder/router/resolver,
while **PCIE_EP's own internal model has no ADR**. The consequence:
- "What latency does PCIE_EP model?" requires reading the source.
- The asymmetry with peer components (IO_CPU = ADR-0036, M_CPU = ADR-0035) is
awkward.
- Future decisions about a more detailed PCIe link-layer model (TLP credits,
retry, MPS chunking) lack a documented baseline.
This ADR pins down the current **thin PCIE_EP model** and records that this
thinness is intentional (aligned with ADR-0033's latency-model simplification
policy).
## Decision
### D1. PCIE_EP uses ComponentBase's generic forwarding worker as-is
`PcieEpComponent` extends `ComponentBase` and does **not** override `_worker` or
`_forward_txn`. Every Transaction flows through the standard sequence:
1. `_fan_in` accumulates inbound messages (and reassembles Flits, per ADR-0033
Phase 2c) into `_inbox`.
2. `_worker` pulls one message off `_inbox` and spawns
`env.process(self._forward_txn(env, txn))` for per-message pipelining.
3. `_forward_txn` calls the op_log start hook → `run()` for latency → op_log
end hook.
4. `run()` is a single line: `yield env.timeout(overhead_ns)`.
5. If a next hop exists, `out_ports[next_hop].put(txn.advance())`. Otherwise
(terminal arrival) consume `drain_ns` and call `txn.done.succeed()`.
### D2. The only timing parameter is `overhead_ns`
Only `node.attrs["overhead_ns"]` is accepted as a latency parameter. The code
default is `0.0`; `topology.yaml`'s IOChiplet `components.pcie_ep.attrs`
supplies the real value (current topology: `overhead_ns: 5.0` ns).
No separate BW-serialization resource (`simpy.Resource`), no queue depth, no
retry model is introduced. Link-level BW serialization is handled wire-side —
inside the IOChiplet by `pcie_ep_to_noc_bw_gbs = 256.0 GB/s`, and externally by
the system's `io_ep_to_switch` link BW (ADR-0015 port/wire model). PCIE_EP
itself takes no part in that accounting.
### D3. PCIE_EP is direction-aware in topology but direction-blind in code
The builder lays both `switch ↔ pcie_ep` and `pcie_ep ↔ io_noc` edges, so
PCIE_EP serves:
- inbound (host → device): forward Transactions arriving from the switch onto
io_noc-side next-hop.
- outbound (device → host): forward Transactions arriving from io_noc/io_cpu
back to the switch.
Both are handled by D1's generic forwarding worker; the component code never
distinguishes direction (it just follows `txn.next_hop`).
### D4. PCIE_EP is not Flit-aware (legacy reassembly path)
`_FLIT_AWARE` is left at the inherited `False`, so `_fan_in` reassembles
upstream-chunkified Flits into the parent Transaction before delivery to
`_inbox` (aligned with ADR-0033 Phase 2c incremental rollout).
A future PCIe TLP-level credit model would revisit D4.
### D5. PCIE_EP is a **named node** for routing helpers
`policy/routing/router.py` provides `find_pcie_ep(sip, io_id="io0")`,
`find_all_pcie_eps()`, and `find_memory_path(pcie_ep, dst_node)` — all of
which treat PCIE_EP as the start (or end) of the memory path. The component
itself supplies no information to these helpers; the naming convention
(`sip{S}.{io_id}.pcie_ep`) is guaranteed by the topology builder.
## Alternatives Considered
### A1. Full PCIe TLP-level model (credits, retry, MPS chunking)
Rejected. Violates ADR-0033's "current latency model = abstract overhead + BW
serialization" simplification. Host↔device protocol fidelity is explicitly
out-of-scope in SPEC §5 "Non-Goals".
### A2. Per-PCIE_EP `simpy.Resource` for in-flight cap
Rejected. Host traffic is not a contention bottleneck in current workloads.
Defer to a separate ADR if it becomes one (in which case D1 stays and D2 is
extended).
### A3. Merge PCIE_EP into IO_CPU
Rejected. PCIE_EP is the protocol-boundary node first hit on the host side;
IO_CPU is the device-side control-plane processing node (ADR-0036). Traffic
fan-out and command decoding costs concentrate in IO_CPU, while PCIE_EP only
expresses link-edge overhead. Merging them would mix two responsibilities and
violate the spirit of ADR-0007 (runtime API/sim_engine boundaries).
## Consequences
- PCIE_EP gets an explicit model ADR despite having near-zero code — consistent
with peer component ADRs, lower maintenance friction.
- Future PCIe-level refinement supersedes by extending D2/D4 in a new ADR.
- D5 makes the named-node dependency explicit, so any future renaming of
component IDs has a clearly bounded blast radius.
@@ -0,0 +1,203 @@
# ADR-0039: PE_MMU Component Model — Component + Utility Dual Role
## Status
Accepted (2026-05-20).
ADR-0011 (PA/VA/LA address model) only states that "the VA model translates
VA→PA via PE_MMU"; this ADR pins down **the PE_MMU component's own behavior
model**.
## First action
At construction, read `node.attrs["page_size"]` (default `2 MiB`) and
`node.attrs["tlb_overhead_ns"]` (default `0.0`) and instantiate the internal
`PeMMU` utility object (`policy.address.pe_mmu.PeMMU`) exactly once. That
object is the single owner of the page table, the sub-page region lists, and
the TLB overhead value.
At runtime the first action splits into two paths:
- **Component path (inbox consumption)**: `_worker` pulls a Transaction off
`_inbox`; if `request` is a `MmuMapMsg`, call `self._mmu.map(va, pa, size)`
for each entry and then `txn.done.succeed()`. For `MmuUnmapMsg`, call
`unmap(va, size)`. Any other type falls through to standard `_forward_txn`.
In other words, **the component's first act is "apply map/unmap commands to
the page table"**.
- **Utility path (direct call)**: a sibling PE engine (PE_DMA / PE_GEMM) calls
`pe_mmu.mmu.translate(va)` directly. This path produces no SimPy events;
the caller (when `overhead_ns > 0`) issues a `yield env.timeout(mmu.overhead_ns)`
in its own process.
## Context
ADR-0011 defined three address models (PA/VA/LA) and agreed that "VA model =
translation via PE_MMU". But in code, `PeMmuComponent` performs two
complementary roles simultaneously:
1. **A topology-graph component**: it receives `MmuMapMsg` / `MmuUnmapMsg`
sideband messages over the cube NoC and updates the page table.
2. **A PE-local utility**: PE_DMA / PE_GEMM on the same PE call
`translate(va)` directly with zero SimPy latency (the caller pays
`overhead_ns` if any).
Without an ADR covering both roles, the following questions are ambiguous:
- "Why isn't there a SimPy event for the MMU translate?" (Answer: the caller
pays it.)
- What is the sub-page region model, and why? (The code docstring has it, but
no ADR — only a memory note `project_mmu_subpage_stopgap`.)
- Who sends map/unmap, and when must they be visible? (Ordering contract.)
Additionally, `PeMMU.map()` has "append, last-write-wins on overlap"
semantics, which is impossible to express with a one-PA-per-entry page table.
That is a deliberate **simulator stopgap** to support DPPolicy sub-page sharding
(e.g., 128 B payloads against 4 KiB pages) without silent last-write-wins
misrouting. This deviation from real HW MMU semantics must be ADR-pinned.
## Decision
### D1. Explicit dual role — component and utility
`PeMmuComponent` exposes two interfaces from a single class:
- Component interface: `_inbox` consumption, `_worker` loop (handles MMU
sideband messages).
- Utility interface: the `mmu` property exposes the underlying `PeMMU` object,
which PE_DMA / PE_GEMM hold directly and invoke `translate()` on.
The latter is **not a layer skip**: inside a PE, the engines and PE_MMU are
siblings under the "components" layer (ADR-0007). Cross-layer violations only
apply to runtime API ↔ sim_engine ↔ components boundaries.
### D2. Latency model — `translate()` is pure; caller owns the timeout
`PeMMU.translate()` is a pure function and yields nothing in SimPy. The caller
(a PE engine) issues `if mmu.overhead_ns > 0: yield env.timeout(mmu.overhead_ns)`
in its own process after translation.
Rationale: the PE engine process already holds its own `record_start` /
`record_end` (op_log) hooks, so keeping timing inside the caller's process
preserves consistent timing accounting. A separate MMU process would split the
engine's processing flow and blur op_log / pipeline overlap semantics.
#### D2.1. Current implementation asymmetry — pipeline vs non-pipeline (known)
At the time of writing, `pe_dma.py` handles MMU overhead differently in its
two call paths:
- **non-pipeline (`handle_command`)**: after `translate()`, applies
`if self._mmu.overhead_ns > 0: yield env.timeout(self._mmu.overhead_ns)`.
- **pipeline (`_do_pipeline_dma`)**: calls `translate()` only, **omitting**
the overhead timeout — though the comment says "same logic as non-pipeline
path", the behaviors differ.
In the default topology, `tlb_overhead_ns = 0.0`, so this asymmetry does not
manifest. With `tlb_overhead_ns > 0`, however, GEMM/Math via the pipeline path
appears MMU-overhead faster than the equivalent non-pipeline workload.
The D2 contract states that **all** callers pay the overhead; the pipeline
omission is **not an intentional design** — ADR-0014 D6 (pipeline self-routing)
does not exempt it. Remediation options (require a separate Phase 1/2):
- (a) Add `if mmu.overhead_ns > 0: yield env.timeout(...)` in
`_do_pipeline_dma` to align with D2 — **preferred**.
- (b) Narrow the D2 contract to "non-pipeline only" and document the pipeline
exemption in an ADR-0014 update — discouraged, since it weakens the
overhead's meaning.
This ADR recommends (a) and assumes a small follow-up change either before or
just after acceptance.
### D3. Page table structure — sub-page region list (stopgap)
`self._table: dict[vpn, list[(start_in_page, end_in_page, pa_at_offset_zero)]]`
holds multiple disjoint regions per page.
- `map(va, pa, size)`: append regions when the range crosses a page boundary.
- `translate(va)`: look up regions for the VPN and iterate **in reverse** so
the most recent overlapping region wins (last-write-wins).
- `unmap(va, size)`: remove only regions whose extent is **fully contained**
within the unmap range; partial-overlap boundaries are left in place and the
caller is expected to unmap on the same boundaries used for map.
This is documented as a **simulator stopgap** that supplements the VA model
from ADR-0011. It prevents silent last-write-wins misrouting when DPPolicy
shards below page granularity. Memory note: `project_mmu_subpage_stopgap`.
### D4. PageFault signals PA fallback
If `translate()` is called with an unmapped VA, `PageFault` is raised. PE_DMA
catches the exception and **uses the original address as a PA** (the PA-only
backward-compatibility path from ADR-0011). PageFault is therefore not an
error — it is the signal for "no VA mapping, interpret as PA".
This path is intentional and preserves backward compatibility with the
ADR-0011 PA-only mode.
### D5. MMU sideband-message reception contract
`MmuMapMsg` / `MmuUnmapMsg` arrive over the fabric at PE_MMU's `_inbox`
(SPEC R10: "MMU map installation incurs measured fabric latency"). Schemas
live in `runtime_api/kernel.py`:
- `MmuMapMsg.entries: tuple[dict, ...]` — each dict is
`{"va": int, "pa": int, "size": int}`.
- `MmuUnmapMsg.entries: tuple[dict, ...]` — each dict is
`{"va": int, "size": int}`.
PE_MMU reception flow:
1. `_worker` does `_inbox.get()` for one message.
2. `hasattr(msg, "request")` confirms a Transaction wrapper.
3. `isinstance(msg.request, MmuMapMsg)` → for each entry, call
`self._mmu.map(va=e["va"], pa=e["pa"], size=e["size"])`.
4. `isinstance(msg.request, MmuUnmapMsg)` → for each entry, call
`self._mmu.unmap(va=e["va"], size=e["size"])`.
5. Both signal `msg.done.succeed()` after completion.
An external caller (runtime API) `await`ing `done` therefore receives a SimPy
guarantee that "the mapping is installed on-device" — this is the realization
of ADR-0011's "MMU map installation incurs measured fabric latency".
This ADR does **not** define the **sender or fan-out policy** for the sideband
message — those are runtime API responsibilities. Only the receive contract
belongs here.
### D6. Non-MMU Transactions delegate to generic forwarding
If a message pulled from `_inbox` is not `MmuMapMsg` / `MmuUnmapMsg` (or
lacks a `request` attribute), `_forward_txn` handles it normally. This keeps
the door open for future topologies where PE_MMU sits on a pass-through path —
current code never sends such traffic, but the routing remains safe.
## Alternatives Considered
### A1. Make `translate()` a SimPy generator
Rejected. As D2 explains, this blurs op_log / pipeline overlap accounting in
the PE engine.
### A2. Use small page size (e.g., 128 B) instead of sub-page regions
Rejected. Would explode page-table memory and cube-wide map message size. Most
mappings are 2 MiB; pushing the page size below that for the few DPPolicy
sharding cases inflates average cost.
### A3. Make PE_MMU a PE_CPU helper only (not a topology node)
Rejected. ADR-0011 requires that MMU map installation incur measured fabric
latency (via `MmuMapMsg`), which requires PE_MMU to be a node on the graph.
It also keeps cube NoC visualizer output consistent.
## Consequences
- PE_MMU's dual role is justified at ADR level, so future "unify into one"
refactor pressure has a documented counterpoint.
- The sub-page region model is explicitly labeled a stopgap, providing a
basis for deprecating it when LA model (ADR-0011) lands.
- The "`translate()` does not yield" contract is locked in (D2), so any
future proposal to add an internal MMU timeout can be denied with a
documented rationale.
- PA fallback (D4) is normalized, preventing defensive logic from treating
PageFault as an error.
@@ -0,0 +1,149 @@
# ADR-0040: PE_TCM Component Model — Dual-Channel BW Serialization
## Status
Accepted (2026-05-20).
ADR-0014 (PE Pipeline Execution Model, D1) references PE_TCM as a "BW-based
serialized scratchpad memory" but does not pin down the component's own model.
This ADR fills that gap.
## First action
When `start()` is invoked, immediately create two `simpy.Resource(env, capacity=1)`
instances and store them in `self._read_res` / `self._write_res`. These two
resources are the single decision points that serialize the **read channel**
and **write channel** to one in-flight request each.
The runtime first action: `_worker` pulls a message off `_inbox` and branches
by type:
- `TcmRequest` (from `pe_fetch_store`): spawn `env.process(self._handle_tcm_request)`.
Hence **TCM's first act is "acquire the lock matching the direction
(read/write)"**. After lock acquisition, if `bw > 0 and nbytes > 0`, yield
`env.timeout(delay_ns = nbytes / bw)`, then `req.done.succeed()`.
- Anything else (Transaction): spawn `env.process(self._forward_txn)` (legacy
fabric pass-through).
At construction, `node.attrs["read_bw_gbs"]` and `node.attrs["write_bw_gbs"]`
(default `512.0 GB/s` each) are captured and held.
## Context
In the PE pipeline (ADR-0014 D1, D6), PE_TCM receives two kinds of traffic:
1. **`TcmRequest` from PE_FETCH_STORE** — when moving data between TCM and
the register file, PE_FETCH_STORE sends a short sideband request to obtain
BW-serialized access latency (`direction = "read"` or `"write"`, `nbytes`,
`done` event).
2. **Legacy Transaction forwarding** — a fallback in case TCM ends up as a
pass-through node on the fabric graph (not used by the current critical
path, but preserved).
The problem: ADR-0014 only says "BW-based serialization" without specifying:
- Read and write are **independent channels** running in parallel; only
same-direction concurrency serializes at `capacity=1`.
- BW is split into two configurable values (`read_bw_gbs` / `write_bw_gbs`).
- The formula is `delay_ns = nbytes / bw_gbs` (loose unit convention:
GB/s × ns ≈ B).
- `nbytes == 0` still acquires the lock but skips the BW term.
- `run()`'s `overhead_ns` (default `0.0`) is only used in the legacy fabric
forwarding path.
Each of these requires an ADR. In particular, "why are read and write
separate channels" and "who owns the BW values" must be documented so that
future changes (e.g., `capacity=2`) have a clear basis.
## Decision
### D1. Dual channel — read and write are independent resources
`_read_res = simpy.Resource(env, capacity=1)`,
`_write_res = simpy.Resource(env, capacity=1)`.
Same-direction concurrent requests queue on the resource and serialize;
opposite-direction requests proceed in parallel. This matches the hardware
model where TCM has a dual-port (read + write) configuration, and it allows
the simulator to express the GEMM-pipeline case where fetch (read) and store
(write) overlap in time — modeled as BW-serialized inside each direction but
independent across directions.
### D2. Per-channel BW model — `nbytes / bw_gbs`
After lock acquisition, if `nbytes > 0 and bw > 0`, yield
`env.timeout(nbytes / bw_gbs)`. The unit convention is GB/s × ns ≈ B,
consistent with the simulator-wide loose convention (see ADR-0033).
- `nbytes == 0`: BW term is zero, but the lock is acquired and released. This
is intentional: when a plan generator emits an empty fetch/store on the
PE_FETCH_STORE side, the op_log / channel accounting on the TCM side still
records one consumption.
- `bw == 0` (config error): the timeout call is skipped (0-time pass). Should
not occur with normal settings.
### D3. BW values come from `node.attrs.read_bw_gbs` / `write_bw_gbs`
Defaults `512.0 GB/s`. The topology builder (`topology/builder.py`) passes
these attrs when instantiating TCM from `pe_template`. Default changes should
coincide with related decisions in ADR-0014 D1 or ADR-0033.
### D4. TcmRequest schema is owned by PE_TCM
`@dataclass TcmRequest(direction: str, nbytes: int, done: simpy.Event, tag: str = "")`
lives in `components/builtin/pe_tcm.py`. PE_FETCH_STORE imports the dataclass
and only constructs/sends it. The caller does not define the schema because:
- The meaning of BW serialization is TCM's responsibility — TCM decides which
fields drive serialization.
- The valid-value check for `direction` (must be `"read"` or `"write"`) lives
in `_handle_tcm_request`'s if/else branch.
### D5. Legacy Transaction forwarding path is preserved
When `_worker` receives a non-`TcmRequest` message, it dispatches to
`_forward_txn`, applying `run()`'s `overhead_ns`. The current standard PE
pipeline does not route Transactions through TCM, but the path is kept to
avoid breakage if fabric topology changes.
This path is accounted for via standard Transaction op_log; the BW channel
locks are **not** acquired (orthogonal to D1's usage).
### D6. PE_TCM is not a data store (timing only)
TCM models **time only**. The actual data payload is held by sim_engine's
`memory_store` (when present); the TCM component never updates it.
PE_FETCH_STORE obtains BW delay through `TcmRequest`, and register contents
are handled separately in the data path (ADR-0020 2-pass data execution —
Phase 2).
## Alternatives Considered
### A1. Single channel (`capacity=2` for shared read+write)
Rejected. Would artificially serialize the normal-case overlap of fetch
(read) and store (write) and yield an incorrect BW upper bound for the PE
pipeline.
### A2. `capacity > 1` (e.g., 2-banked TCM)
Rejected. Current hardware model assumes a single bank. Multi-bank extension
needs its own ADR that would supersede D1. Bumping capacity now would loosen
the nominal serialization without raising the BW upper bound, producing less
accurate modeling.
### A3. Generalize BW formula to `nbytes / bw + overhead_ns`
Rejected. `overhead_ns` is reserved for the legacy forwarding path (D5).
Additional fetch/store-path overhead, if needed, belongs in PE_FETCH_STORE's
`run()` or in a register-file access model — closer to the responsibility
boundary.
## Consequences
- TCM's BW accounting is locked at ADR level. Questions arising from op_log
in GEMM/Math sweeps — "why did fetch and store overlap?", "why do only
same-direction requests serialize?" — resolve quickly to D1.
- Future multi-bank TCM models or asymmetric read/write BW changes have a
clear blast radius (D1 / D2 / D3 — pick one).
- D6 ("TCM is not a data store") sharpens the responsibility boundary with
ADR-0020 2-pass execution.
@@ -0,0 +1,195 @@
# ADR-0041: Cube SRAM Component Model — terminal scratchpad on cube NoC
## Status
Accepted (2026-05-20).
ADR-0017 (Cube NOC and HBM Connectivity) describes SRAM as a cube-NoC
attachment but does not specify the SRAM component's own latency / response
model. This ADR fills that gap.
## First action
Inside `_worker`, immediately after pulling a Transaction off `_inbox`, the
very first action is `yield from self.run(env, txn.nbytes)`. Inside `run()`,
the component applies `env.timeout(node.attrs["overhead_ns"])`
(default `0.0`).
In short, **SRAM's first act is "express access overhead as simulator time"**.
After overhead, the worker yields `drain_ns` (the terminal BW-serialization
cost stamped on the Transaction) and then constructs and dispatches a
`ResponseMsg` on the reverse path.
This differs from a generic `ComponentBase._worker`: SRAM knows it is a
**terminal node**, so it does not go through `_forward_txn`. Its own worker
explicitly performs `run → drain → _send_response`.
## Context
The cube topology (`topology/builder.py`) creates the following named nodes
per cube:
- `sip{S}.cube{C}.m_cpu`
- `sip{S}.cube{C}.sram`
- `sip{S}.cube{C}.hbm_ctrl` (per-PE partitions)
- `sip{S}.cube{C}.pe{P}` (and its PE-internal sub-components)
SRAM is one of the cube-NoC attachments — `topology/mesh_gen.py` assigns it
to the nearest router by placement coordinates and adds `"sram"` to that
router's `attach` list. The builder lays bidirectional `sram ↔ router` edges
(BW: `sram_to_router_bw_gbs`, default `128.0 GB/s`).
SRAM has two intertwined roles:
1. **Fabric terminal**: the endpoint for cube-NoC memory-access Transactions
destined for SRAM. SRAM consumes access overhead + drain, then sends a
response back on the reverse path.
2. **One of the IPCQ slot tiers**: ADR-0023 D9.7 defines
`buffer_kind ∈ {tcm, sram, hbm}`; the `sram` tier's per-access cost is
`(512.0 GB/s, 2.0 ns)` in `common/ipcq_types._BUFFER_KIND_BW`. This is
separate from the SRAM node's `overhead_ns` attr; PE_DMA accounts for it
directly at the IPCQ slot-write moment.
Without an ADR covering both roles, the following questions are ambiguous:
- "What latency does SRAM model?" — fabric drain + overhead, or the IPCQ
tier slot latency? — answers scatter.
- What does the `size_mb` (`32`) attr mean in the future? Currently it is not
used; SRAM only models timing.
- Which cube router does SRAM attach to? (placement-based; lives in topology
code only.)
## Decision
### D1. SRAM is a terminal scratchpad node on the cube NoC
`SramComponent` extends `ComponentBase` but overrides `_worker` to express
terminal semantics directly:
```
while True:
txn = yield self._inbox.get()
yield from self.run(env, txn.nbytes) # overhead_ns
if drain_ns > 0: yield env.timeout(drain_ns)
yield from self._send_response(env, txn)
```
This pattern is necessary because SRAM must know the reverse path; the
generic `_forward_txn` (which forwards to the next hop) does not fit a
terminal.
#### D1.1. Currently dormant — the `_worker` override is an unused path
At the time of writing, **no component actually sends a Transaction to the
SRAM node**. The verified references to the SRAM node ID are:
- `policy/routing/router.py` and friends — guarantee path lookups.
- `components/builtin/pe_dma.py::_handle_ipcq_inbound` — for
`buffer_kind == "sram"`, computes the *path* to
`bank_node = f"{cube_prefix}.sram"` via `compute_drain_ns(path, ...)` and
yields a **local** timeout. The Transaction itself does not flow to the
SRAM node (see D4).
- `tests/test_routing.py` — checks connectivity via
`find_path("sip0.cube0.pe0", "sip0.cube0.sram")`.
So the `_worker` / `_send_response` override is currently a **dormant code
path**. It is preserved deliberately:
- Topology changes that route fabric Transactions to SRAM terminally (e.g.,
explicit M_CPU → SRAM accesses) would activate it immediately.
- ADR-0017's "cube-attached scratchpad" semantics naturally implies terminal
behavior; the override is an intentional placeholder.
A future ADR (or a revision to this one) will mark dormancy resolved when an
actual sender is added.
### D2. ResponseMsg construction and reverse-path dispatch
`_send_response`:
1. `reverse_path = list(reversed(txn.path))` — derive the reverse path.
2. Construct `ResponseMsg(correlation_id=txn.request.correlation_id,
request_id=..., src_cube=<this cube>, src_pe=-1, success=True)`.
3. Wrap in `Transaction(request=resp_msg, path=reverse_path, step=0,
nbytes=0, done=env.event(), is_response=True)` and put on
`out_ports[reverse_path[1]]`.
4. If the reverse path is too short (`< 2 hops`) or `ctx` is absent, fall
back to calling the original `txn.done.succeed()`.
`src_pe = -1` means "SRAM is not PE-localized". `src_cube` is parsed from the
node ID (`sip{S}.cube{C}.sram`).
### D3. Timing parameters: `overhead_ns` and wire-side `drain_ns`
- **Component-side latency**: `node.attrs["overhead_ns"]`. Default topology
uses `2.0 ns`.
- **Link-side serialization**: `drain_ns` arrives stamped on the Transaction
— the wire-side BW serialization result from ADR-0015. SRAM only yields it.
- The `size_mb` (default `32 MiB`) attr is currently timing-neutral. If a
capacity-aware model is added in the future, a separate ADR will give it
meaning.
### D4. IPCQ slot accounting is not modeled by the SRAM component
Per ADR-0023 D9.7, the IPCQ slot-write latency for the SRAM tier is incurred
inside PE_DMA's `_handle_ipcq_inbound`, which calls
`slot_io_latency_ns("sram", nbytes)` using `_BUFFER_KIND_BW["sram"]`. That is:
- When SRAM receives a fabric Transaction (D1, D2, D3 apply), it processes
normally.
- When an IPCQ slot lives on SRAM, PE_DMA pays the slot-write time directly —
independent of the SRAM component.
This separation is intentional: IPCQ is a fast path (sub-cycle slot
bookkeeping) and does not traverse fabric Transactions, so SRAM does not need
to know about IPCQ.
### D5. SRAM's cube-NoC attachment is placement-driven
`topology/mesh_gen.py` reads `placement.sram.pos_mm` (default `[1.5, 9.0]` in
`topology.yaml`) and adds `"sram"` to the nearest router's `attach`. The
builder (`topology/builder.py`'s attachment loop) then lays bidirectional
edges between the `sram` node and that router.
This decision lives outside the SRAM component (mesh_gen / builder); the
component does not know which router it sits on. It only relies on
`txn.path` / `reverse_path` to reach it via a router.
### D6. SRAM is not a data store (timing only)
Same context as ADR-0040 D6: the SRAM component models time only; the data
payload (if any) lives in sim_engine's `memory_store`.
## Alternatives Considered
### A1. Use `_forward_txn` and route responses via separate nodes (à la IO_CPU / HBM_CTRL)
Rejected. SRAM is a terminal on the cube NoC; adding a response node would
introduce meaningless hops and violate ADR-0017's simplification spirit.
### A2. Model BW serialization inside SRAM with its own resource
Rejected. Wire-side BW serialization (`drain_ns`) already captures it. An
internal `simpy.Resource` would double-count against ADR-0015 (port/wire
model).
### A3. Handle IPCQ slot accounting in the SRAM component
Rejected. As D4 makes explicit, IPCQ is a fast path that does not traverse
fabric Transactions. If SRAM knew about IPCQ, the responsibility would split
across two places and obscure reasoning.
### A4. Capacity-aware latency from `size_mb`
Rejected for now. The capacity is currently a visualizer label; introducing
a capacity-aware timing model requires a dedicated ADR.
## Consequences
- SRAM's timing model is pinned at ADR level as
`overhead_ns + drain_ns + ResponseMsg(reverse_path)`. Any proposal to push
IPCQ slot latency into the SRAM component can be refused with D4.
- D3 records that `size_mb` is timing-neutral today, so a future
capacity-aware model has a narrow compatibility scope.
- D5 documents the placement-driven attachment, so changes to the SRAM
coordinate have a clearly bounded impact (`mesh_gen` only).
@@ -0,0 +1,199 @@
# ADR-0042: Tile Plan Generators — GEMM/Math Pipeline Plan Builders
## Status
Accepted (2026-05-20).
This ADR pins down `tiling.py` as a **plan-generator
module**, not a SimPy component.
ADR-0014 (PE Pipeline Execution Model) D6 (tile plan / self-routing) does not
specify the tile-plan generation algorithm itself; this ADR fills that gap.
## First action
When `generate_gemm_plan(M, K, N, tile_m, tile_k, tile_n, ..., pe_prefix,
a_pinned, b_pinned, epilogue_specs)` is called, the very first action is
**computing tile counts and constructing the PE-component ID strings**:
```
M_tiles = max(1, ceil(M / tile_m))
K_tiles = max(1, ceil(K / tile_k))
N_tiles = max(1, ceil(N / tile_n))
dma_id = f"{pe_prefix}.pe_dma"
fetch_id = f"{pe_prefix}.pe_fetch_store"
gemm_id = f"{pe_prefix}.pe_gemm"
math_id = f"{pe_prefix}.pe_math"
```
In short, **the plan generator's first act is "compute ceiling tile counts
and assemble the four sub-component IDs for this PE once"**. No SimPy event
or environment is touched — this module is a pure function.
`generate_math_plan(M, N, tile_m, tile_n, ..., math_op, src_addr, dst_addr,
pe_prefix)` likewise begins by computing `M_tiles`, `N_tiles` and assembling
three component IDs (`dma_id`, `fetch_id`, `math_id`).
## Context
ADR-0014 D6 agreed that "PE_SCHEDULER, on receiving a CompositeCmd, generates
a TilePlan and feeds self-routing tile tokens". But the **concrete plan
generation algorithm** lives in `src/kernbench/components/builtin/tiling.py`,
which:
- Defines no component — it is a pair of **pure functions**
(`generate_gemm_plan`, `generate_math_plan`).
- Does not depend on the SimPy environment, queues, op_log, or hooks.
- Returns a `PipelinePlan` (dataclass).
The original G4 analysis incorrectly described `tiling.py` as a component;
it is in fact a plan-builder helper consumed by PE_SCHEDULER. Pinning this
down in its own ADR (paired with ADR-0014 D6) prevents:
- Ambiguity over whether plan generation belongs to PE_SCHEDULER or a
separate module.
- Inconsistent rationale for stage sequences (e.g., FETCH/STORE position)
between GEMM and Math plans.
- Undocumented branching rationale for `a_pinned` / `b_pinned` /
`epilogue_specs`.
## Decision
### D1. `tiling` is a pure plan-generator module, not a component
`components/builtin/tiling.py` defines no `ComponentBase` subclass. It exports
two module-level functions:
- `generate_gemm_plan(...) -> PipelinePlan`
- `generate_math_plan(...) -> PipelinePlan`
There is no `tiling` node in the topology graph. It lives in `builtin/`
because it is a direct helper for PE_SCHEDULER (ADR-0014 D6) and is
conceptually a PE_SCHEDULER internal utility.
### D2. GEMM plan stage sequence — `M → N → K` order
For each `(m, n, k)` tile (default — no operand pinning, no epilogue):
```
[DMA_READ(A)] → [DMA_READ(B)] → FETCH → GEMM
(last k tile only) [MATH(output_tile)]* → STORE → DMA_WRITE
```
`k_tile` epilogue inserts a MATH stage immediately after GEMM on every
K-tile; `output_tile` epilogue inserts MATH stages once per `(m, n)` after
the final K-tile but before STORE/DMA_WRITE. The K-loop accumulator stays
in the register file across K-tiles — STORE/DMA_WRITE happens only when
`last_k`.
### D3. Operand pinning — `a_pinned` / `b_pinned`
If a caller passes `a_pinned=True`, **the A DMA_READ is omitted from every
(m, n, k) tile**. Semantically: the caller (e.g., `tl.composite`) has already
staged all of A in TCM via a prior `tl.load`, and signals so to the plan
generator.
The branch is made at plan time (not at runtime). Therefore the stage record
count in op_log changes deterministically with pinning, and sweep analyses
(e.g., gemm_sweep's stage record count) see this decision directly.
### D4. Epilogue scope — `k_tile` vs `output_tile`
`epilogue_specs` is an iterable of op-spec objects. Each op object is
expected to have:
- `op.kind: str` — math op name (e.g., `"dequant"`, `"bias"`, `"relu"`,
`"scale"`). Placed into the stage's `params["op_kind"]`.
- `op.scope: Scope``Scope.K_TILE` or `Scope.OUTPUT_TILE` (`Scope` enum
in `kernbench.common.pe_commands`).
- Op-specific extras (e.g., `bias`, `scale`, `factor`) — currently not used
by the plan generator; consumed at runtime by PE_MATH.
The plan generator partitions by `getattr(o, "scope", None)`:
- `scope == Scope.K_TILE`: adds a MATH stage right after GEMM on every K-tile.
- `scope == Scope.OUTPUT_TILE`: adds MATH stages just before STORE on the
last K-tile per `(m, n)`.
Ops with neither `scope` value (e.g., missing attribute) are **dropped
silently** — `getattr(..., None) == Scope.X` is False for both. Picking a
default (`output_tile`) is the **caller's responsibility** (e.g.,
`tl.composite`), not the plan generator's. This aligns with ADR-0014's
composite epilogue contract.
`Scope` is imported lazily inside the function to avoid the circular path
`pe_commands ← pe_types ← tiling`. This is intentional and not a refactor
target — keeping `tiling` free of compile-time `pe_commands` dependencies
preserves the module boundary (D1).
### D5. Math plan stage sequence — `M → N` order
For each `(m, n)` tile:
```
DMA_READ → FETCH → MATH → STORE → DMA_WRITE
```
There is no K dimension, so concepts like epilogue or accumulator residency
do not apply. PE_FETCH_STORE's register-file accounting follows the same
pattern as the GEMM plan.
### D6. Plans are data — no SimPy dependency
`PipelinePlan` is a dataclass in `pe_types.py` holding `tiles:
list[TilePlan]`. Each `TilePlan` holds `stages: tuple[Stage, ...]`. The plan
itself is near-immutable (only `Stage.params: dict` is mutable) and holds no
SimPy objects.
At runtime, PE_SCHEDULER consumes the plan's first stage, builds a `TileToken`,
and feeds it into the pipeline. The TileToken carries `plan: TilePlan`,
`stage_idx: int`, and a cached `params: dict`. Self-routing proceeds by
`TileToken.advance()` caching the next stage's `params` (ADR-0014 D6).
### D7. Plan generator contract — pure, deterministic, idempotent
Two calls with identical inputs return identical `PipelinePlan` instances
(including `TilePlan.stages` order). This contract aligns with ADR-0014 D6's
"deterministic tile dispatch order".
No side effects (no SimPy events, no file I/O, no global state) — tests can
call the generators directly without an environment object (some cases in
`tests/test_pe_pipeline.py` rely on this).
## Alternatives Considered
### A1. Make tiling a component (e.g., PE_PLANNER)
Rejected. Plan generation consumes no SimPy time — it is a pure decision
algorithm. Making it a component would (a) add unnecessary infrastructure
(inbox, resources), and (b) split PE_SCHEDULER's flow into "receive plan"
plus "feed tiles", inserting a meaningless hop.
### A2. Move plan generation into PE_SCHEDULER as methods
Rejected (currently). Module separation provides (1) testability and
(2) extensibility for additional plan algorithms (e.g., DTensor-aware) —
add a new function. If plan kinds proliferate enough to require explicit
dispatch, a future ADR can introduce a plan factory on PE_SCHEDULER.
### A3. Make plans fully immutable (frozen dataclass + tuple)
Partially adopted. `Stage` and `TilePlan` are dataclasses but not frozen,
because `Stage.params: dict` is populated at plan-generation time and read
at runtime (cached by TileToken on advance). Moving dict → frozendict pays
migration cost without enough benefit. Convention: do not mutate after
generation.
## Consequences
- `tiling.py` is documented as a plan-generator module, not a component —
preempting future G4-style "this component lacks an ADR" analyses.
- The GEMM plan's stage sequence (D2) and pinning / epilogue branching
(D3 / D4) are pinned, providing a clear interpretation basis for sweep
analyses (e.g., `scripts/gemm_sweep.py`'s stage record counts).
- The plan generator's pure contract (D7) enables environment-free testing
in line with ADR-0013 (verification strategy).
- Future plan kinds (DTensor-aware, K-major, ...) follow D1 / D6 / D7 as a
baseline — just add a new function.