Files
kernbench2/docs/adr/ADR-0040-dev-pe-tcm-component-model.md
ywkang 1f36baa898 ADR: add 0038-0042 (pcie_ep, pe_mmu, pe_tcm, sram, tiling)
Fill component-model coverage gaps surfaced by /report's G4 analysis.
Each ADR documents the component's First action, latency model, and
honest notes on dormant code or implementation asymmetries discovered
during re-evaluation against current code.

- 0038 pcie_ep: thin protocol-overhead model; ComponentBase forwarding
  worker as-is; named-node contract for router helpers
- 0039 pe_mmu: component + utility dual role; sub-page region stopgap;
  D2.1 flags pipeline path missing mmu.overhead_ns timeout (asymmetric
  with non-pipeline; not visible at default tlb_overhead_ns=0)
- 0040 pe_tcm: dual-channel BW serialization (read/write Resource cap=1);
  TcmRequest schema owned by TCM; timing-only (no data store)
- 0041 sram: terminal scratchpad model + ResponseMsg on reverse path;
  D1.1 flags _worker override as currently dormant (no Transaction
  actually targets the SRAM node today)
- 0042 tiling: pure plan-generator module, not a component; corrects
  the G4 misclassification; pins GEMM/Math stage sequences and
  epilogue scope contract

Also: /report skill G3 refinement — only flag older->newer asymmetric
cross-references; newer->older (e.g., 0034-0037 citing infrastructure
ADRs) are expected one-way and no longer reported.

Bilingual pair verifier (tools/verify_adr_lang_pairs.py) passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:43:03 -07:00

150 lines
6.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0040: PE_TCM Component Model — Dual-Channel BW Serialization
## Status
Accepted (2026-05-20).
ADR-0014 (PE Pipeline Execution Model, D1) references PE_TCM as a "BW-based
serialized scratchpad memory" but does not pin down the component's own model.
This ADR fills that gap.
## First action
When `start()` is invoked, immediately create two `simpy.Resource(env, capacity=1)`
instances and store them in `self._read_res` / `self._write_res`. These two
resources are the single decision points that serialize the **read channel**
and **write channel** to one in-flight request each.
The runtime first action: `_worker` pulls a message off `_inbox` and branches
by type:
- `TcmRequest` (from `pe_fetch_store`): spawn `env.process(self._handle_tcm_request)`.
Hence **TCM's first act is "acquire the lock matching the direction
(read/write)"**. After lock acquisition, if `bw > 0 and nbytes > 0`, yield
`env.timeout(delay_ns = nbytes / bw)`, then `req.done.succeed()`.
- Anything else (Transaction): spawn `env.process(self._forward_txn)` (legacy
fabric pass-through).
At construction, `node.attrs["read_bw_gbs"]` and `node.attrs["write_bw_gbs"]`
(default `512.0 GB/s` each) are captured and held.
## Context
In the PE pipeline (ADR-0014 D1, D6), PE_TCM receives two kinds of traffic:
1. **`TcmRequest` from PE_FETCH_STORE** — when moving data between TCM and
the register file, PE_FETCH_STORE sends a short sideband request to obtain
BW-serialized access latency (`direction = "read"` or `"write"`, `nbytes`,
`done` event).
2. **Legacy Transaction forwarding** — a fallback in case TCM ends up as a
pass-through node on the fabric graph (not used by the current critical
path, but preserved).
The problem: ADR-0014 only says "BW-based serialization" without specifying:
- Read and write are **independent channels** running in parallel; only
same-direction concurrency serializes at `capacity=1`.
- BW is split into two configurable values (`read_bw_gbs` / `write_bw_gbs`).
- The formula is `delay_ns = nbytes / bw_gbs` (loose unit convention:
GB/s × ns ≈ B).
- `nbytes == 0` still acquires the lock but skips the BW term.
- `run()`'s `overhead_ns` (default `0.0`) is only used in the legacy fabric
forwarding path.
Each of these requires an ADR. In particular, "why are read and write
separate channels" and "who owns the BW values" must be documented so that
future changes (e.g., `capacity=2`) have a clear basis.
## Decision
### D1. Dual channel — read and write are independent resources
`_read_res = simpy.Resource(env, capacity=1)`,
`_write_res = simpy.Resource(env, capacity=1)`.
Same-direction concurrent requests queue on the resource and serialize;
opposite-direction requests proceed in parallel. This matches the hardware
model where TCM has a dual-port (read + write) configuration, and it allows
the simulator to express the GEMM-pipeline case where fetch (read) and store
(write) overlap in time — modeled as BW-serialized inside each direction but
independent across directions.
### D2. Per-channel BW model — `nbytes / bw_gbs`
After lock acquisition, if `nbytes > 0 and bw > 0`, yield
`env.timeout(nbytes / bw_gbs)`. The unit convention is GB/s × ns ≈ B,
consistent with the simulator-wide loose convention (see ADR-0033).
- `nbytes == 0`: BW term is zero, but the lock is acquired and released. This
is intentional: when a plan generator emits an empty fetch/store on the
PE_FETCH_STORE side, the op_log / channel accounting on the TCM side still
records one consumption.
- `bw == 0` (config error): the timeout call is skipped (0-time pass). Should
not occur with normal settings.
### D3. BW values come from `node.attrs.read_bw_gbs` / `write_bw_gbs`
Defaults `512.0 GB/s`. The topology builder (`topology/builder.py`) passes
these attrs when instantiating TCM from `pe_template`. Default changes should
coincide with related decisions in ADR-0014 D1 or ADR-0033.
### D4. TcmRequest schema is owned by PE_TCM
`@dataclass TcmRequest(direction: str, nbytes: int, done: simpy.Event, tag: str = "")`
lives in `components/builtin/pe_tcm.py`. PE_FETCH_STORE imports the dataclass
and only constructs/sends it. The caller does not define the schema because:
- The meaning of BW serialization is TCM's responsibility — TCM decides which
fields drive serialization.
- The valid-value check for `direction` (must be `"read"` or `"write"`) lives
in `_handle_tcm_request`'s if/else branch.
### D5. Legacy Transaction forwarding path is preserved
When `_worker` receives a non-`TcmRequest` message, it dispatches to
`_forward_txn`, applying `run()`'s `overhead_ns`. The current standard PE
pipeline does not route Transactions through TCM, but the path is kept to
avoid breakage if fabric topology changes.
This path is accounted for via standard Transaction op_log; the BW channel
locks are **not** acquired (orthogonal to D1's usage).
### D6. PE_TCM is not a data store (timing only)
TCM models **time only**. The actual data payload is held by sim_engine's
`memory_store` (when present); the TCM component never updates it.
PE_FETCH_STORE obtains BW delay through `TcmRequest`, and register contents
are handled separately in the data path (ADR-0020 2-pass data execution —
Phase 2).
## Alternatives Considered
### A1. Single channel (`capacity=2` for shared read+write)
Rejected. Would artificially serialize the normal-case overlap of fetch
(read) and store (write) and yield an incorrect BW upper bound for the PE
pipeline.
### A2. `capacity > 1` (e.g., 2-banked TCM)
Rejected. Current hardware model assumes a single bank. Multi-bank extension
needs its own ADR that would supersede D1. Bumping capacity now would loosen
the nominal serialization without raising the BW upper bound, producing less
accurate modeling.
### A3. Generalize BW formula to `nbytes / bw + overhead_ns`
Rejected. `overhead_ns` is reserved for the legacy forwarding path (D5).
Additional fetch/store-path overhead, if needed, belongs in PE_FETCH_STORE's
`run()` or in a register-file access model — closer to the responsibility
boundary.
## Consequences
- TCM's BW accounting is locked at ADR level. Questions arising from op_log
in GEMM/Math sweeps — "why did fetch and store overlap?", "why do only
same-direction requests serialize?" — resolve quickly to D1.
- Future multi-bank TCM models or asymmetric read/write BW changes have a
clear blast radius (D1 / D2 / D3 — pick one).
- D6 ("TCM is not a data store") sharpens the responsibility boundary with
ADR-0020 2-pass execution.