ADR: add 0038-0042 (pcie_ep, pe_mmu, pe_tcm, sram, tiling)
Fill component-model coverage gaps surfaced by /report's G4 analysis. Each ADR documents the component's First action, latency model, and honest notes on dormant code or implementation asymmetries discovered during re-evaluation against current code. - 0038 pcie_ep: thin protocol-overhead model; ComponentBase forwarding worker as-is; named-node contract for router helpers - 0039 pe_mmu: component + utility dual role; sub-page region stopgap; D2.1 flags pipeline path missing mmu.overhead_ns timeout (asymmetric with non-pipeline; not visible at default tlb_overhead_ns=0) - 0040 pe_tcm: dual-channel BW serialization (read/write Resource cap=1); TcmRequest schema owned by TCM; timing-only (no data store) - 0041 sram: terminal scratchpad model + ResponseMsg on reverse path; D1.1 flags _worker override as currently dormant (no Transaction actually targets the SRAM node today) - 0042 tiling: pure plan-generator module, not a component; corrects the G4 misclassification; pins GEMM/Math stage sequences and epilogue scope contract Also: /report skill G3 refinement — only flag older->newer asymmetric cross-references; newer->older (e.g., 0034-0037 citing infrastructure ADRs) are expected one-way and no longer reported. Bilingual pair verifier (tools/verify_adr_lang_pairs.py) passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,149 @@
|
||||
# ADR-0040: PE_TCM Component Model — Dual-Channel BW Serialization
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-20).
|
||||
|
||||
ADR-0014 (PE Pipeline Execution Model, D1) references PE_TCM as a "BW-based
|
||||
serialized scratchpad memory" but does not pin down the component's own model.
|
||||
This ADR fills that gap.
|
||||
|
||||
## First action
|
||||
|
||||
When `start()` is invoked, immediately create two `simpy.Resource(env, capacity=1)`
|
||||
instances and store them in `self._read_res` / `self._write_res`. These two
|
||||
resources are the single decision points that serialize the **read channel**
|
||||
and **write channel** to one in-flight request each.
|
||||
|
||||
The runtime first action: `_worker` pulls a message off `_inbox` and branches
|
||||
by type:
|
||||
|
||||
- `TcmRequest` (from `pe_fetch_store`): spawn `env.process(self._handle_tcm_request)`.
|
||||
Hence **TCM's first act is "acquire the lock matching the direction
|
||||
(read/write)"**. After lock acquisition, if `bw > 0 and nbytes > 0`, yield
|
||||
`env.timeout(delay_ns = nbytes / bw)`, then `req.done.succeed()`.
|
||||
- Anything else (Transaction): spawn `env.process(self._forward_txn)` (legacy
|
||||
fabric pass-through).
|
||||
|
||||
At construction, `node.attrs["read_bw_gbs"]` and `node.attrs["write_bw_gbs"]`
|
||||
(default `512.0 GB/s` each) are captured and held.
|
||||
|
||||
## Context
|
||||
|
||||
In the PE pipeline (ADR-0014 D1, D6), PE_TCM receives two kinds of traffic:
|
||||
|
||||
1. **`TcmRequest` from PE_FETCH_STORE** — when moving data between TCM and
|
||||
the register file, PE_FETCH_STORE sends a short sideband request to obtain
|
||||
BW-serialized access latency (`direction = "read"` or `"write"`, `nbytes`,
|
||||
`done` event).
|
||||
2. **Legacy Transaction forwarding** — a fallback in case TCM ends up as a
|
||||
pass-through node on the fabric graph (not used by the current critical
|
||||
path, but preserved).
|
||||
|
||||
The problem: ADR-0014 only says "BW-based serialization" without specifying:
|
||||
|
||||
- Read and write are **independent channels** running in parallel; only
|
||||
same-direction concurrency serializes at `capacity=1`.
|
||||
- BW is split into two configurable values (`read_bw_gbs` / `write_bw_gbs`).
|
||||
- The formula is `delay_ns = nbytes / bw_gbs` (loose unit convention:
|
||||
GB/s × ns ≈ B).
|
||||
- `nbytes == 0` still acquires the lock but skips the BW term.
|
||||
- `run()`'s `overhead_ns` (default `0.0`) is only used in the legacy fabric
|
||||
forwarding path.
|
||||
|
||||
Each of these requires an ADR. In particular, "why are read and write
|
||||
separate channels" and "who owns the BW values" must be documented so that
|
||||
future changes (e.g., `capacity=2`) have a clear basis.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Dual channel — read and write are independent resources
|
||||
|
||||
`_read_res = simpy.Resource(env, capacity=1)`,
|
||||
`_write_res = simpy.Resource(env, capacity=1)`.
|
||||
Same-direction concurrent requests queue on the resource and serialize;
|
||||
opposite-direction requests proceed in parallel. This matches the hardware
|
||||
model where TCM has a dual-port (read + write) configuration, and it allows
|
||||
the simulator to express the GEMM-pipeline case where fetch (read) and store
|
||||
(write) overlap in time — modeled as BW-serialized inside each direction but
|
||||
independent across directions.
|
||||
|
||||
### D2. Per-channel BW model — `nbytes / bw_gbs`
|
||||
|
||||
After lock acquisition, if `nbytes > 0 and bw > 0`, yield
|
||||
`env.timeout(nbytes / bw_gbs)`. The unit convention is GB/s × ns ≈ B,
|
||||
consistent with the simulator-wide loose convention (see ADR-0033).
|
||||
|
||||
- `nbytes == 0`: BW term is zero, but the lock is acquired and released. This
|
||||
is intentional: when a plan generator emits an empty fetch/store on the
|
||||
PE_FETCH_STORE side, the op_log / channel accounting on the TCM side still
|
||||
records one consumption.
|
||||
- `bw == 0` (config error): the timeout call is skipped (0-time pass). Should
|
||||
not occur with normal settings.
|
||||
|
||||
### D3. BW values come from `node.attrs.read_bw_gbs` / `write_bw_gbs`
|
||||
|
||||
Defaults `512.0 GB/s`. The topology builder (`topology/builder.py`) passes
|
||||
these attrs when instantiating TCM from `pe_template`. Default changes should
|
||||
coincide with related decisions in ADR-0014 D1 or ADR-0033.
|
||||
|
||||
### D4. TcmRequest schema is owned by PE_TCM
|
||||
|
||||
`@dataclass TcmRequest(direction: str, nbytes: int, done: simpy.Event, tag: str = "")`
|
||||
lives in `components/builtin/pe_tcm.py`. PE_FETCH_STORE imports the dataclass
|
||||
and only constructs/sends it. The caller does not define the schema because:
|
||||
|
||||
- The meaning of BW serialization is TCM's responsibility — TCM decides which
|
||||
fields drive serialization.
|
||||
- The valid-value check for `direction` (must be `"read"` or `"write"`) lives
|
||||
in `_handle_tcm_request`'s if/else branch.
|
||||
|
||||
### D5. Legacy Transaction forwarding path is preserved
|
||||
|
||||
When `_worker` receives a non-`TcmRequest` message, it dispatches to
|
||||
`_forward_txn`, applying `run()`'s `overhead_ns`. The current standard PE
|
||||
pipeline does not route Transactions through TCM, but the path is kept to
|
||||
avoid breakage if fabric topology changes.
|
||||
|
||||
This path is accounted for via standard Transaction op_log; the BW channel
|
||||
locks are **not** acquired (orthogonal to D1's usage).
|
||||
|
||||
### D6. PE_TCM is not a data store (timing only)
|
||||
|
||||
TCM models **time only**. The actual data payload is held by sim_engine's
|
||||
`memory_store` (when present); the TCM component never updates it.
|
||||
PE_FETCH_STORE obtains BW delay through `TcmRequest`, and register contents
|
||||
are handled separately in the data path (ADR-0020 2-pass data execution —
|
||||
Phase 2).
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. Single channel (`capacity=2` for shared read+write)
|
||||
|
||||
Rejected. Would artificially serialize the normal-case overlap of fetch
|
||||
(read) and store (write) and yield an incorrect BW upper bound for the PE
|
||||
pipeline.
|
||||
|
||||
### A2. `capacity > 1` (e.g., 2-banked TCM)
|
||||
|
||||
Rejected. Current hardware model assumes a single bank. Multi-bank extension
|
||||
needs its own ADR that would supersede D1. Bumping capacity now would loosen
|
||||
the nominal serialization without raising the BW upper bound, producing less
|
||||
accurate modeling.
|
||||
|
||||
### A3. Generalize BW formula to `nbytes / bw + overhead_ns`
|
||||
|
||||
Rejected. `overhead_ns` is reserved for the legacy forwarding path (D5).
|
||||
Additional fetch/store-path overhead, if needed, belongs in PE_FETCH_STORE's
|
||||
`run()` or in a register-file access model — closer to the responsibility
|
||||
boundary.
|
||||
|
||||
## Consequences
|
||||
|
||||
- TCM's BW accounting is locked at ADR level. Questions arising from op_log
|
||||
in GEMM/Math sweeps — "why did fetch and store overlap?", "why do only
|
||||
same-direction requests serialize?" — resolve quickly to D1.
|
||||
- Future multi-bank TCM models or asymmetric read/write BW changes have a
|
||||
clear blast radius (D1 / D2 / D3 — pick one).
|
||||
- D6 ("TCM is not a data store") sharpens the responsibility boundary with
|
||||
ADR-0020 2-pass execution.
|
||||
Reference in New Issue
Block a user