ADR: add 0038-0042 (pcie_ep, pe_mmu, pe_tcm, sram, tiling)

Fill component-model coverage gaps surfaced by /report's G4 analysis. Each ADR documents the component's First action, latency model, and honest notes on dormant code or implementation asymmetries discovered during re-evaluation against current code. - 0038 pcie_ep: thin protocol-overhead model; ComponentBase forwarding worker as-is; named-node contract for router helpers - 0039 pe_mmu: component + utility dual role; sub-page region stopgap; D2.1 flags pipeline path missing mmu.overhead_ns timeout (asymmetric with non-pipeline; not visible at default tlb_overhead_ns=0) - 0040 pe_tcm: dual-channel BW serialization (read/write Resource cap=1); TcmRequest schema owned by TCM; timing-only (no data store) - 0041 sram: terminal scratchpad model + ResponseMsg on reverse path; D1.1 flags _worker override as currently dormant (no Transaction actually targets the SRAM node today) - 0042 tiling: pure plan-generator module, not a component; corrects the G4 misclassification; pins GEMM/Math stage sequences and epilogue scope contract Also: /report skill G3 refinement — only flag older->newer asymmetric cross-references; newer->older (e.g., 0034-0037 citing infrastructure ADRs) are expected one-way and no longer reported. Bilingual pair verifier (tools/verify_adr_lang_pairs.py) passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:43:03 -07:00
parent 049e3d8bb3
commit 1f36baa898
11 changed files with 1747 additions and 3 deletions
@@ -0,0 +1,149 @@
+# ADR-0040: PE_TCM Component Model — Dual-Channel BW Serialization
+
+## Status
+
+Accepted (2026-05-20).
+
+ADR-0014 (PE Pipeline Execution Model, D1) references PE_TCM as a "BW-based
+serialized scratchpad memory" but does not pin down the component's own model.
+This ADR fills that gap.
+
+## First action
+
+When `start()` is invoked, immediately create two `simpy.Resource(env, capacity=1)`
+instances and store them in `self._read_res` / `self._write_res`. These two
+resources are the single decision points that serialize the **read channel**
+and **write channel** to one in-flight request each.
+
+The runtime first action: `_worker` pulls a message off `_inbox` and branches
+by type:
+
+- `TcmRequest` (from `pe_fetch_store`): spawn `env.process(self._handle_tcm_request)`.
+  Hence **TCM's first act is "acquire the lock matching the direction
+  (read/write)"**. After lock acquisition, if `bw > 0 and nbytes > 0`, yield
+  `env.timeout(delay_ns = nbytes / bw)`, then `req.done.succeed()`.
+- Anything else (Transaction): spawn `env.process(self._forward_txn)` (legacy
+  fabric pass-through).
+
+At construction, `node.attrs["read_bw_gbs"]` and `node.attrs["write_bw_gbs"]`
+(default `512.0 GB/s` each) are captured and held.
+
+## Context
+
+In the PE pipeline (ADR-0014 D1, D6), PE_TCM receives two kinds of traffic:
+
+1. **`TcmRequest` from PE_FETCH_STORE** — when moving data between TCM and
+   the register file, PE_FETCH_STORE sends a short sideband request to obtain
+   BW-serialized access latency (`direction = "read"` or `"write"`, `nbytes`,
+   `done` event).
+2. **Legacy Transaction forwarding** — a fallback in case TCM ends up as a
+   pass-through node on the fabric graph (not used by the current critical
+   path, but preserved).
+
+The problem: ADR-0014 only says "BW-based serialization" without specifying:
+
+- Read and write are **independent channels** running in parallel; only
+  same-direction concurrency serializes at `capacity=1`.
+- BW is split into two configurable values (`read_bw_gbs` / `write_bw_gbs`).
+- The formula is `delay_ns = nbytes / bw_gbs` (loose unit convention:
+  GB/s × ns ≈ B).
+- `nbytes == 0` still acquires the lock but skips the BW term.
+- `run()`'s `overhead_ns` (default `0.0`) is only used in the legacy fabric
+  forwarding path.
+
+Each of these requires an ADR. In particular, "why are read and write
+separate channels" and "who owns the BW values" must be documented so that
+future changes (e.g., `capacity=2`) have a clear basis.
+
+## Decision
+
+### D1. Dual channel — read and write are independent resources
+
+`_read_res = simpy.Resource(env, capacity=1)`,
+`_write_res = simpy.Resource(env, capacity=1)`.
+Same-direction concurrent requests queue on the resource and serialize;
+opposite-direction requests proceed in parallel. This matches the hardware
+model where TCM has a dual-port (read + write) configuration, and it allows
+the simulator to express the GEMM-pipeline case where fetch (read) and store
+(write) overlap in time — modeled as BW-serialized inside each direction but
+independent across directions.
+
+### D2. Per-channel BW model — `nbytes / bw_gbs`
+
+After lock acquisition, if `nbytes > 0 and bw > 0`, yield
+`env.timeout(nbytes / bw_gbs)`. The unit convention is GB/s × ns ≈ B,
+consistent with the simulator-wide loose convention (see ADR-0033).
+
+- `nbytes == 0`: BW term is zero, but the lock is acquired and released. This
+  is intentional: when a plan generator emits an empty fetch/store on the
+  PE_FETCH_STORE side, the op_log / channel accounting on the TCM side still
+  records one consumption.
+- `bw == 0` (config error): the timeout call is skipped (0-time pass). Should
+  not occur with normal settings.
+
+### D3. BW values come from `node.attrs.read_bw_gbs` / `write_bw_gbs`
+
+Defaults `512.0 GB/s`. The topology builder (`topology/builder.py`) passes
+these attrs when instantiating TCM from `pe_template`. Default changes should
+coincide with related decisions in ADR-0014 D1 or ADR-0033.
+
+### D4. TcmRequest schema is owned by PE_TCM
+
+`@dataclass TcmRequest(direction: str, nbytes: int, done: simpy.Event, tag: str = "")`
+lives in `components/builtin/pe_tcm.py`. PE_FETCH_STORE imports the dataclass
+and only constructs/sends it. The caller does not define the schema because:
+
+- The meaning of BW serialization is TCM's responsibility — TCM decides which
+  fields drive serialization.
+- The valid-value check for `direction` (must be `"read"` or `"write"`) lives
+  in `_handle_tcm_request`'s if/else branch.
+
+### D5. Legacy Transaction forwarding path is preserved
+
+When `_worker` receives a non-`TcmRequest` message, it dispatches to
+`_forward_txn`, applying `run()`'s `overhead_ns`. The current standard PE
+pipeline does not route Transactions through TCM, but the path is kept to
+avoid breakage if fabric topology changes.
+
+This path is accounted for via standard Transaction op_log; the BW channel
+locks are **not** acquired (orthogonal to D1's usage).
+
+### D6. PE_TCM is not a data store (timing only)
+
+TCM models **time only**. The actual data payload is held by sim_engine's
+`memory_store` (when present); the TCM component never updates it.
+PE_FETCH_STORE obtains BW delay through `TcmRequest`, and register contents
+are handled separately in the data path (ADR-0020 2-pass data execution —
+Phase 2).
+
+## Alternatives Considered
+
+### A1. Single channel (`capacity=2` for shared read+write)
+
+Rejected. Would artificially serialize the normal-case overlap of fetch
+(read) and store (write) and yield an incorrect BW upper bound for the PE
+pipeline.
+
+### A2. `capacity > 1` (e.g., 2-banked TCM)
+
+Rejected. Current hardware model assumes a single bank. Multi-bank extension
+needs its own ADR that would supersede D1. Bumping capacity now would loosen
+the nominal serialization without raising the BW upper bound, producing less
+accurate modeling.
+
+### A3. Generalize BW formula to `nbytes / bw + overhead_ns`
+
+Rejected. `overhead_ns` is reserved for the legacy forwarding path (D5).
+Additional fetch/store-path overhead, if needed, belongs in PE_FETCH_STORE's
+`run()` or in a register-file access model — closer to the responsibility
+boundary.
+
+## Consequences
+
+- TCM's BW accounting is locked at ADR level. Questions arising from op_log
+  in GEMM/Math sweeps — "why did fetch and store overlap?", "why do only
+  same-direction requests serialize?" — resolve quickly to D1.
+- Future multi-bank TCM models or asymmetric read/write BW changes have a
+  clear blast radius (D1 / D2 / D3 — pick one).
+- D6 ("TCM is not a data store") sharpens the responsibility boundary with
+  ADR-0020 2-pass execution.