1f36baa898
Fill component-model coverage gaps surfaced by /report's G4 analysis. Each ADR documents the component's First action, latency model, and honest notes on dormant code or implementation asymmetries discovered during re-evaluation against current code. - 0038 pcie_ep: thin protocol-overhead model; ComponentBase forwarding worker as-is; named-node contract for router helpers - 0039 pe_mmu: component + utility dual role; sub-page region stopgap; D2.1 flags pipeline path missing mmu.overhead_ns timeout (asymmetric with non-pipeline; not visible at default tlb_overhead_ns=0) - 0040 pe_tcm: dual-channel BW serialization (read/write Resource cap=1); TcmRequest schema owned by TCM; timing-only (no data store) - 0041 sram: terminal scratchpad model + ResponseMsg on reverse path; D1.1 flags _worker override as currently dormant (no Transaction actually targets the SRAM node today) - 0042 tiling: pure plan-generator module, not a component; corrects the G4 misclassification; pins GEMM/Math stage sequences and epilogue scope contract Also: /report skill G3 refinement — only flag older->newer asymmetric cross-references; newer->older (e.g., 0034-0037 citing infrastructure ADRs) are expected one-way and no longer reported. Bilingual pair verifier (tools/verify_adr_lang_pairs.py) passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
150 lines
6.4 KiB
Markdown
150 lines
6.4 KiB
Markdown
# ADR-0040: PE_TCM Component Model — Dual-Channel BW Serialization
|
||
|
||
## Status
|
||
|
||
Accepted (2026-05-20).
|
||
|
||
ADR-0014 (PE Pipeline Execution Model, D1) references PE_TCM as a "BW-based
|
||
serialized scratchpad memory" but does not pin down the component's own model.
|
||
This ADR fills that gap.
|
||
|
||
## First action
|
||
|
||
When `start()` is invoked, immediately create two `simpy.Resource(env, capacity=1)`
|
||
instances and store them in `self._read_res` / `self._write_res`. These two
|
||
resources are the single decision points that serialize the **read channel**
|
||
and **write channel** to one in-flight request each.
|
||
|
||
The runtime first action: `_worker` pulls a message off `_inbox` and branches
|
||
by type:
|
||
|
||
- `TcmRequest` (from `pe_fetch_store`): spawn `env.process(self._handle_tcm_request)`.
|
||
Hence **TCM's first act is "acquire the lock matching the direction
|
||
(read/write)"**. After lock acquisition, if `bw > 0 and nbytes > 0`, yield
|
||
`env.timeout(delay_ns = nbytes / bw)`, then `req.done.succeed()`.
|
||
- Anything else (Transaction): spawn `env.process(self._forward_txn)` (legacy
|
||
fabric pass-through).
|
||
|
||
At construction, `node.attrs["read_bw_gbs"]` and `node.attrs["write_bw_gbs"]`
|
||
(default `512.0 GB/s` each) are captured and held.
|
||
|
||
## Context
|
||
|
||
In the PE pipeline (ADR-0014 D1, D6), PE_TCM receives two kinds of traffic:
|
||
|
||
1. **`TcmRequest` from PE_FETCH_STORE** — when moving data between TCM and
|
||
the register file, PE_FETCH_STORE sends a short sideband request to obtain
|
||
BW-serialized access latency (`direction = "read"` or `"write"`, `nbytes`,
|
||
`done` event).
|
||
2. **Legacy Transaction forwarding** — a fallback in case TCM ends up as a
|
||
pass-through node on the fabric graph (not used by the current critical
|
||
path, but preserved).
|
||
|
||
The problem: ADR-0014 only says "BW-based serialization" without specifying:
|
||
|
||
- Read and write are **independent channels** running in parallel; only
|
||
same-direction concurrency serializes at `capacity=1`.
|
||
- BW is split into two configurable values (`read_bw_gbs` / `write_bw_gbs`).
|
||
- The formula is `delay_ns = nbytes / bw_gbs` (loose unit convention:
|
||
GB/s × ns ≈ B).
|
||
- `nbytes == 0` still acquires the lock but skips the BW term.
|
||
- `run()`'s `overhead_ns` (default `0.0`) is only used in the legacy fabric
|
||
forwarding path.
|
||
|
||
Each of these requires an ADR. In particular, "why are read and write
|
||
separate channels" and "who owns the BW values" must be documented so that
|
||
future changes (e.g., `capacity=2`) have a clear basis.
|
||
|
||
## Decision
|
||
|
||
### D1. Dual channel — read and write are independent resources
|
||
|
||
`_read_res = simpy.Resource(env, capacity=1)`,
|
||
`_write_res = simpy.Resource(env, capacity=1)`.
|
||
Same-direction concurrent requests queue on the resource and serialize;
|
||
opposite-direction requests proceed in parallel. This matches the hardware
|
||
model where TCM has a dual-port (read + write) configuration, and it allows
|
||
the simulator to express the GEMM-pipeline case where fetch (read) and store
|
||
(write) overlap in time — modeled as BW-serialized inside each direction but
|
||
independent across directions.
|
||
|
||
### D2. Per-channel BW model — `nbytes / bw_gbs`
|
||
|
||
After lock acquisition, if `nbytes > 0 and bw > 0`, yield
|
||
`env.timeout(nbytes / bw_gbs)`. The unit convention is GB/s × ns ≈ B,
|
||
consistent with the simulator-wide loose convention (see ADR-0033).
|
||
|
||
- `nbytes == 0`: BW term is zero, but the lock is acquired and released. This
|
||
is intentional: when a plan generator emits an empty fetch/store on the
|
||
PE_FETCH_STORE side, the op_log / channel accounting on the TCM side still
|
||
records one consumption.
|
||
- `bw == 0` (config error): the timeout call is skipped (0-time pass). Should
|
||
not occur with normal settings.
|
||
|
||
### D3. BW values come from `node.attrs.read_bw_gbs` / `write_bw_gbs`
|
||
|
||
Defaults `512.0 GB/s`. The topology builder (`topology/builder.py`) passes
|
||
these attrs when instantiating TCM from `pe_template`. Default changes should
|
||
coincide with related decisions in ADR-0014 D1 or ADR-0033.
|
||
|
||
### D4. TcmRequest schema is owned by PE_TCM
|
||
|
||
`@dataclass TcmRequest(direction: str, nbytes: int, done: simpy.Event, tag: str = "")`
|
||
lives in `components/builtin/pe_tcm.py`. PE_FETCH_STORE imports the dataclass
|
||
and only constructs/sends it. The caller does not define the schema because:
|
||
|
||
- The meaning of BW serialization is TCM's responsibility — TCM decides which
|
||
fields drive serialization.
|
||
- The valid-value check for `direction` (must be `"read"` or `"write"`) lives
|
||
in `_handle_tcm_request`'s if/else branch.
|
||
|
||
### D5. Legacy Transaction forwarding path is preserved
|
||
|
||
When `_worker` receives a non-`TcmRequest` message, it dispatches to
|
||
`_forward_txn`, applying `run()`'s `overhead_ns`. The current standard PE
|
||
pipeline does not route Transactions through TCM, but the path is kept to
|
||
avoid breakage if fabric topology changes.
|
||
|
||
This path is accounted for via standard Transaction op_log; the BW channel
|
||
locks are **not** acquired (orthogonal to D1's usage).
|
||
|
||
### D6. PE_TCM is not a data store (timing only)
|
||
|
||
TCM models **time only**. The actual data payload is held by sim_engine's
|
||
`memory_store` (when present); the TCM component never updates it.
|
||
PE_FETCH_STORE obtains BW delay through `TcmRequest`, and register contents
|
||
are handled separately in the data path (ADR-0020 2-pass data execution —
|
||
Phase 2).
|
||
|
||
## Alternatives Considered
|
||
|
||
### A1. Single channel (`capacity=2` for shared read+write)
|
||
|
||
Rejected. Would artificially serialize the normal-case overlap of fetch
|
||
(read) and store (write) and yield an incorrect BW upper bound for the PE
|
||
pipeline.
|
||
|
||
### A2. `capacity > 1` (e.g., 2-banked TCM)
|
||
|
||
Rejected. Current hardware model assumes a single bank. Multi-bank extension
|
||
needs its own ADR that would supersede D1. Bumping capacity now would loosen
|
||
the nominal serialization without raising the BW upper bound, producing less
|
||
accurate modeling.
|
||
|
||
### A3. Generalize BW formula to `nbytes / bw + overhead_ns`
|
||
|
||
Rejected. `overhead_ns` is reserved for the legacy forwarding path (D5).
|
||
Additional fetch/store-path overhead, if needed, belongs in PE_FETCH_STORE's
|
||
`run()` or in a register-file access model — closer to the responsibility
|
||
boundary.
|
||
|
||
## Consequences
|
||
|
||
- TCM's BW accounting is locked at ADR level. Questions arising from op_log
|
||
in GEMM/Math sweeps — "why did fetch and store overlap?", "why do only
|
||
same-direction requests serialize?" — resolve quickly to D1.
|
||
- Future multi-bank TCM models or asymmetric read/write BW changes have a
|
||
clear blast radius (D1 / D2 / D3 — pick one).
|
||
- D6 ("TCM is not a data store") sharpens the responsibility boundary with
|
||
ADR-0020 2-pass execution.
|