Fill component-model coverage gaps surfaced by /report's G4 analysis. Each ADR documents the component's First action, latency model, and honest notes on dormant code or implementation asymmetries discovered during re-evaluation against current code. - 0038 pcie_ep: thin protocol-overhead model; ComponentBase forwarding worker as-is; named-node contract for router helpers - 0039 pe_mmu: component + utility dual role; sub-page region stopgap; D2.1 flags pipeline path missing mmu.overhead_ns timeout (asymmetric with non-pipeline; not visible at default tlb_overhead_ns=0) - 0040 pe_tcm: dual-channel BW serialization (read/write Resource cap=1); TcmRequest schema owned by TCM; timing-only (no data store) - 0041 sram: terminal scratchpad model + ResponseMsg on reverse path; D1.1 flags _worker override as currently dormant (no Transaction actually targets the SRAM node today) - 0042 tiling: pure plan-generator module, not a component; corrects the G4 misclassification; pins GEMM/Math stage sequences and epilogue scope contract Also: /report skill G3 refinement — only flag older->newer asymmetric cross-references; newer->older (e.g., 0034-0037 citing infrastructure ADRs) are expected one-way and no longer reported. Bilingual pair verifier (tools/verify_adr_lang_pairs.py) passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.4 KiB
ADR-0040: PE_TCM Component Model — Dual-Channel BW Serialization
Status
Accepted (2026-05-20).
ADR-0014 (PE Pipeline Execution Model, D1) references PE_TCM as a "BW-based serialized scratchpad memory" but does not pin down the component's own model. This ADR fills that gap.
First action
When start() is invoked, immediately create two simpy.Resource(env, capacity=1)
instances and store them in self._read_res / self._write_res. These two
resources are the single decision points that serialize the read channel
and write channel to one in-flight request each.
The runtime first action: _worker pulls a message off _inbox and branches
by type:
TcmRequest(frompe_fetch_store): spawnenv.process(self._handle_tcm_request). Hence TCM's first act is "acquire the lock matching the direction (read/write)". After lock acquisition, ifbw > 0 and nbytes > 0, yieldenv.timeout(delay_ns = nbytes / bw), thenreq.done.succeed().- Anything else (Transaction): spawn
env.process(self._forward_txn)(legacy fabric pass-through).
At construction, node.attrs["read_bw_gbs"] and node.attrs["write_bw_gbs"]
(default 512.0 GB/s each) are captured and held.
Context
In the PE pipeline (ADR-0014 D1, D6), PE_TCM receives two kinds of traffic:
TcmRequestfrom PE_FETCH_STORE — when moving data between TCM and the register file, PE_FETCH_STORE sends a short sideband request to obtain BW-serialized access latency (direction = "read"or"write",nbytes,doneevent).- Legacy Transaction forwarding — a fallback in case TCM ends up as a pass-through node on the fabric graph (not used by the current critical path, but preserved).
The problem: ADR-0014 only says "BW-based serialization" without specifying:
- Read and write are independent channels running in parallel; only
same-direction concurrency serializes at
capacity=1. - BW is split into two configurable values (
read_bw_gbs/write_bw_gbs). - The formula is
delay_ns = nbytes / bw_gbs(loose unit convention: GB/s × ns ≈ B). nbytes == 0still acquires the lock but skips the BW term.run()'soverhead_ns(default0.0) is only used in the legacy fabric forwarding path.
Each of these requires an ADR. In particular, "why are read and write
separate channels" and "who owns the BW values" must be documented so that
future changes (e.g., capacity=2) have a clear basis.
Decision
D1. Dual channel — read and write are independent resources
_read_res = simpy.Resource(env, capacity=1),
_write_res = simpy.Resource(env, capacity=1).
Same-direction concurrent requests queue on the resource and serialize;
opposite-direction requests proceed in parallel. This matches the hardware
model where TCM has a dual-port (read + write) configuration, and it allows
the simulator to express the GEMM-pipeline case where fetch (read) and store
(write) overlap in time — modeled as BW-serialized inside each direction but
independent across directions.
D2. Per-channel BW model — nbytes / bw_gbs
After lock acquisition, if nbytes > 0 and bw > 0, yield
env.timeout(nbytes / bw_gbs). The unit convention is GB/s × ns ≈ B,
consistent with the simulator-wide loose convention (see ADR-0033).
nbytes == 0: BW term is zero, but the lock is acquired and released. This is intentional: when a plan generator emits an empty fetch/store on the PE_FETCH_STORE side, the op_log / channel accounting on the TCM side still records one consumption.bw == 0(config error): the timeout call is skipped (0-time pass). Should not occur with normal settings.
D3. BW values come from node.attrs.read_bw_gbs / write_bw_gbs
Defaults 512.0 GB/s. The topology builder (topology/builder.py) passes
these attrs when instantiating TCM from pe_template. Default changes should
coincide with related decisions in ADR-0014 D1 or ADR-0033.
D4. TcmRequest schema is owned by PE_TCM
@dataclass TcmRequest(direction: str, nbytes: int, done: simpy.Event, tag: str = "")
lives in components/builtin/pe_tcm.py. PE_FETCH_STORE imports the dataclass
and only constructs/sends it. The caller does not define the schema because:
- The meaning of BW serialization is TCM's responsibility — TCM decides which fields drive serialization.
- The valid-value check for
direction(must be"read"or"write") lives in_handle_tcm_request's if/else branch.
D5. Legacy Transaction forwarding path is preserved
When _worker receives a non-TcmRequest message, it dispatches to
_forward_txn, applying run()'s overhead_ns. The current standard PE
pipeline does not route Transactions through TCM, but the path is kept to
avoid breakage if fabric topology changes.
This path is accounted for via standard Transaction op_log; the BW channel locks are not acquired (orthogonal to D1's usage).
D6. PE_TCM is not a data store (timing only)
TCM models time only. The actual data payload is held by sim_engine's
memory_store (when present); the TCM component never updates it.
PE_FETCH_STORE obtains BW delay through TcmRequest, and register contents
are handled separately in the data path (ADR-0020 2-pass data execution —
Phase 2).
Alternatives Considered
A1. Single channel (capacity=2 for shared read+write)
Rejected. Would artificially serialize the normal-case overlap of fetch (read) and store (write) and yield an incorrect BW upper bound for the PE pipeline.
A2. capacity > 1 (e.g., 2-banked TCM)
Rejected. Current hardware model assumes a single bank. Multi-bank extension needs its own ADR that would supersede D1. Bumping capacity now would loosen the nominal serialization without raising the BW upper bound, producing less accurate modeling.
A3. Generalize BW formula to nbytes / bw + overhead_ns
Rejected. overhead_ns is reserved for the legacy forwarding path (D5).
Additional fetch/store-path overhead, if needed, belongs in PE_FETCH_STORE's
run() or in a register-file access model — closer to the responsibility
boundary.
Consequences
- TCM's BW accounting is locked at ADR level. Questions arising from op_log in GEMM/Math sweeps — "why did fetch and store overlap?", "why do only same-direction requests serialize?" — resolve quickly to D1.
- Future multi-bank TCM models or asymmetric read/write BW changes have a clear blast radius (D1 / D2 / D3 — pick one).
- D6 ("TCM is not a data store") sharpens the responsibility boundary with ADR-0020 2-pass execution.