Files
kernbench2/docs/adr/ADR-0040-dev-pe-tcm-component-model.md
T
ywkang 1f36baa898 ADR: add 0038-0042 (pcie_ep, pe_mmu, pe_tcm, sram, tiling)
Fill component-model coverage gaps surfaced by /report's G4 analysis.
Each ADR documents the component's First action, latency model, and
honest notes on dormant code or implementation asymmetries discovered
during re-evaluation against current code.

- 0038 pcie_ep: thin protocol-overhead model; ComponentBase forwarding
  worker as-is; named-node contract for router helpers
- 0039 pe_mmu: component + utility dual role; sub-page region stopgap;
  D2.1 flags pipeline path missing mmu.overhead_ns timeout (asymmetric
  with non-pipeline; not visible at default tlb_overhead_ns=0)
- 0040 pe_tcm: dual-channel BW serialization (read/write Resource cap=1);
  TcmRequest schema owned by TCM; timing-only (no data store)
- 0041 sram: terminal scratchpad model + ResponseMsg on reverse path;
  D1.1 flags _worker override as currently dormant (no Transaction
  actually targets the SRAM node today)
- 0042 tiling: pure plan-generator module, not a component; corrects
  the G4 misclassification; pins GEMM/Math stage sequences and
  epilogue scope contract

Also: /report skill G3 refinement — only flag older->newer asymmetric
cross-references; newer->older (e.g., 0034-0037 citing infrastructure
ADRs) are expected one-way and no longer reported.

Bilingual pair verifier (tools/verify_adr_lang_pairs.py) passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:43:03 -07:00

6.4 KiB
Raw Blame History

ADR-0040: PE_TCM Component Model — Dual-Channel BW Serialization

Status

Accepted (2026-05-20).

ADR-0014 (PE Pipeline Execution Model, D1) references PE_TCM as a "BW-based serialized scratchpad memory" but does not pin down the component's own model. This ADR fills that gap.

First action

When start() is invoked, immediately create two simpy.Resource(env, capacity=1) instances and store them in self._read_res / self._write_res. These two resources are the single decision points that serialize the read channel and write channel to one in-flight request each.

The runtime first action: _worker pulls a message off _inbox and branches by type:

  • TcmRequest (from pe_fetch_store): spawn env.process(self._handle_tcm_request). Hence TCM's first act is "acquire the lock matching the direction (read/write)". After lock acquisition, if bw > 0 and nbytes > 0, yield env.timeout(delay_ns = nbytes / bw), then req.done.succeed().
  • Anything else (Transaction): spawn env.process(self._forward_txn) (legacy fabric pass-through).

At construction, node.attrs["read_bw_gbs"] and node.attrs["write_bw_gbs"] (default 512.0 GB/s each) are captured and held.

Context

In the PE pipeline (ADR-0014 D1, D6), PE_TCM receives two kinds of traffic:

  1. TcmRequest from PE_FETCH_STORE — when moving data between TCM and the register file, PE_FETCH_STORE sends a short sideband request to obtain BW-serialized access latency (direction = "read" or "write", nbytes, done event).
  2. Legacy Transaction forwarding — a fallback in case TCM ends up as a pass-through node on the fabric graph (not used by the current critical path, but preserved).

The problem: ADR-0014 only says "BW-based serialization" without specifying:

  • Read and write are independent channels running in parallel; only same-direction concurrency serializes at capacity=1.
  • BW is split into two configurable values (read_bw_gbs / write_bw_gbs).
  • The formula is delay_ns = nbytes / bw_gbs (loose unit convention: GB/s × ns ≈ B).
  • nbytes == 0 still acquires the lock but skips the BW term.
  • run()'s overhead_ns (default 0.0) is only used in the legacy fabric forwarding path.

Each of these requires an ADR. In particular, "why are read and write separate channels" and "who owns the BW values" must be documented so that future changes (e.g., capacity=2) have a clear basis.

Decision

D1. Dual channel — read and write are independent resources

_read_res = simpy.Resource(env, capacity=1), _write_res = simpy.Resource(env, capacity=1). Same-direction concurrent requests queue on the resource and serialize; opposite-direction requests proceed in parallel. This matches the hardware model where TCM has a dual-port (read + write) configuration, and it allows the simulator to express the GEMM-pipeline case where fetch (read) and store (write) overlap in time — modeled as BW-serialized inside each direction but independent across directions.

D2. Per-channel BW model — nbytes / bw_gbs

After lock acquisition, if nbytes > 0 and bw > 0, yield env.timeout(nbytes / bw_gbs). The unit convention is GB/s × ns ≈ B, consistent with the simulator-wide loose convention (see ADR-0033).

  • nbytes == 0: BW term is zero, but the lock is acquired and released. This is intentional: when a plan generator emits an empty fetch/store on the PE_FETCH_STORE side, the op_log / channel accounting on the TCM side still records one consumption.
  • bw == 0 (config error): the timeout call is skipped (0-time pass). Should not occur with normal settings.

D3. BW values come from node.attrs.read_bw_gbs / write_bw_gbs

Defaults 512.0 GB/s. The topology builder (topology/builder.py) passes these attrs when instantiating TCM from pe_template. Default changes should coincide with related decisions in ADR-0014 D1 or ADR-0033.

D4. TcmRequest schema is owned by PE_TCM

@dataclass TcmRequest(direction: str, nbytes: int, done: simpy.Event, tag: str = "") lives in components/builtin/pe_tcm.py. PE_FETCH_STORE imports the dataclass and only constructs/sends it. The caller does not define the schema because:

  • The meaning of BW serialization is TCM's responsibility — TCM decides which fields drive serialization.
  • The valid-value check for direction (must be "read" or "write") lives in _handle_tcm_request's if/else branch.

D5. Legacy Transaction forwarding path is preserved

When _worker receives a non-TcmRequest message, it dispatches to _forward_txn, applying run()'s overhead_ns. The current standard PE pipeline does not route Transactions through TCM, but the path is kept to avoid breakage if fabric topology changes.

This path is accounted for via standard Transaction op_log; the BW channel locks are not acquired (orthogonal to D1's usage).

D6. PE_TCM is not a data store (timing only)

TCM models time only. The actual data payload is held by sim_engine's memory_store (when present); the TCM component never updates it. PE_FETCH_STORE obtains BW delay through TcmRequest, and register contents are handled separately in the data path (ADR-0020 2-pass data execution — Phase 2).

Alternatives Considered

A1. Single channel (capacity=2 for shared read+write)

Rejected. Would artificially serialize the normal-case overlap of fetch (read) and store (write) and yield an incorrect BW upper bound for the PE pipeline.

A2. capacity > 1 (e.g., 2-banked TCM)

Rejected. Current hardware model assumes a single bank. Multi-bank extension needs its own ADR that would supersede D1. Bumping capacity now would loosen the nominal serialization without raising the BW upper bound, producing less accurate modeling.

A3. Generalize BW formula to nbytes / bw + overhead_ns

Rejected. overhead_ns is reserved for the legacy forwarding path (D5). Additional fetch/store-path overhead, if needed, belongs in PE_FETCH_STORE's run() or in a register-file access model — closer to the responsibility boundary.

Consequences

  • TCM's BW accounting is locked at ADR level. Questions arising from op_log in GEMM/Math sweeps — "why did fetch and store overlap?", "why do only same-direction requests serialize?" — resolve quickly to D1.
  • Future multi-bank TCM models or asymmetric read/write BW changes have a clear blast radius (D1 / D2 / D3 — pick one).
  • D6 ("TCM is not a data store") sharpens the responsibility boundary with ADR-0020 2-pass execution.