kernbench2/docs/adr/ADR-0040-dev-pe-tcm-component-model.md

# ADR-0040: PE_TCM Component Model — Dual-Channel BW Serialization

## Status

Accepted (2026-05-20).

ADR-0014 (PE Pipeline Execution Model, D1) references PE_TCM as a "BW-based
serialized scratchpad memory" but does not pin down the component's own model.
This ADR fills that gap.

## First action

When `start()` is invoked, immediately create two `simpy.Resource(env, capacity=1)`
instances and store them in `self._read_res` / `self._write_res`. These two
resources are the single decision points that serialize the **read channel**
and **write channel** to one in-flight request each.

The runtime first action: `_worker` pulls a message off `_inbox` and branches
by type:

- `TcmRequest` (from `pe_fetch_store`): spawn `env.process(self._handle_tcm_request)`.
  Hence **TCM's first act is "acquire the lock matching the direction
  (read/write)"**. After lock acquisition, if `bw > 0 and nbytes > 0`, yield
  `env.timeout(delay_ns = nbytes / bw)`, then `req.done.succeed()`.
- Anything else (Transaction): spawn `env.process(self._forward_txn)` (legacy
  fabric pass-through).

At construction, `node.attrs["read_bw_gbs"]` and `node.attrs["write_bw_gbs"]`
(default `512.0 GB/s` each) are captured and held.

## Context

In the PE pipeline (ADR-0014 D1, D6), PE_TCM receives two kinds of traffic:

1. **`TcmRequest` from PE_FETCH_STORE** — when moving data between TCM and
   the register file, PE_FETCH_STORE sends a short sideband request to obtain
   BW-serialized access latency (`direction = "read"` or `"write"`, `nbytes`,
   `done` event).
2. **Legacy Transaction forwarding** — a fallback in case TCM ends up as a
   pass-through node on the fabric graph (not used by the current critical
   path, but preserved).

The problem: ADR-0014 only says "BW-based serialization" without specifying:

- Read and write are **independent channels** running in parallel; only
  same-direction concurrency serializes at `capacity=1`.
- BW is split into two configurable values (`read_bw_gbs` / `write_bw_gbs`).
- The formula is `delay_ns = nbytes / bw_gbs` (loose unit convention:
  GB/s × ns ≈ B).
- `nbytes == 0` still acquires the lock but skips the BW term.
- `run()`'s `overhead_ns` (default `0.0`) is only used in the legacy fabric
  forwarding path.

Each of these requires an ADR. In particular, "why are read and write
separate channels" and "who owns the BW values" must be documented so that
future changes (e.g., `capacity=2`) have a clear basis.

## Decision

### D1. Dual channel — read and write are independent resources

`_read_res = simpy.Resource(env, capacity=1)`,
`_write_res = simpy.Resource(env, capacity=1)`.
Same-direction concurrent requests queue on the resource and serialize;
opposite-direction requests proceed in parallel. This matches the hardware
model where TCM has a dual-port (read + write) configuration, and it allows
the simulator to express the GEMM-pipeline case where fetch (read) and store
(write) overlap in time — modeled as BW-serialized inside each direction but
independent across directions.

### D2. Per-channel BW model — `nbytes / bw_gbs`

After lock acquisition, if `nbytes > 0 and bw > 0`, yield
`env.timeout(nbytes / bw_gbs)`. The unit convention is GB/s × ns ≈ B,
consistent with the simulator-wide loose convention (see ADR-0033).

- `nbytes == 0`: BW term is zero, but the lock is acquired and released. This
  is intentional: when a plan generator emits an empty fetch/store on the
  PE_FETCH_STORE side, the op_log / channel accounting on the TCM side still
  records one consumption.
- `bw == 0` (config error): the timeout call is skipped (0-time pass). Should
  not occur with normal settings.

### D3. BW values come from `node.attrs.read_bw_gbs` / `write_bw_gbs`

Defaults `512.0 GB/s`. The topology builder (`topology/builder.py`) passes
these attrs when instantiating TCM from `pe_template`. Default changes should
coincide with related decisions in ADR-0014 D1 or ADR-0033.

### D4. TcmRequest schema is owned by PE_TCM

`@dataclass TcmRequest(direction: str, nbytes: int, done: simpy.Event, tag: str = "")`
lives in `components/builtin/pe_tcm.py`. PE_FETCH_STORE imports the dataclass
and only constructs/sends it. The caller does not define the schema because:

- The meaning of BW serialization is TCM's responsibility — TCM decides which
  fields drive serialization.
- The valid-value check for `direction` (must be `"read"` or `"write"`) lives
  in `_handle_tcm_request`'s if/else branch.

### D5. Legacy Transaction forwarding path is preserved

When `_worker` receives a non-`TcmRequest` message, it dispatches to
`_forward_txn`, applying `run()`'s `overhead_ns`. The current standard PE
pipeline does not route Transactions through TCM, but the path is kept to
avoid breakage if fabric topology changes.

This path is accounted for via standard Transaction op_log; the BW channel
locks are **not** acquired (orthogonal to D1's usage).

### D6. PE_TCM is not a data store (timing only)

TCM models **time only**. The actual data payload is held by sim_engine's
`memory_store` (when present); the TCM component never updates it.
PE_FETCH_STORE obtains BW delay through `TcmRequest`, and register contents
are handled separately in the data path (ADR-0020 2-pass data execution —
Phase 2).

## Alternatives Considered

### A1. Single channel (`capacity=2` for shared read+write)

Rejected. Would artificially serialize the normal-case overlap of fetch
(read) and store (write) and yield an incorrect BW upper bound for the PE
pipeline.

### A2. `capacity > 1` (e.g., 2-banked TCM)

Rejected. Current hardware model assumes a single bank. Multi-bank extension
needs its own ADR that would supersede D1. Bumping capacity now would loosen
the nominal serialization without raising the BW upper bound, producing less
accurate modeling.

### A3. Generalize BW formula to `nbytes / bw + overhead_ns`

Rejected. `overhead_ns` is reserved for the legacy forwarding path (D5).
Additional fetch/store-path overhead, if needed, belongs in PE_FETCH_STORE's
`run()` or in a register-file access model — closer to the responsibility
boundary.

## Consequences

- TCM's BW accounting is locked at ADR level. Questions arising from op_log
  in GEMM/Math sweeps — "why did fetch and store overlap?", "why do only
  same-direction requests serialize?" — resolve quickly to D1.
- Future multi-bank TCM models or asymmetric read/write BW changes have a
  clear blast radius (D1 / D2 / D3 — pick one).
- D6 ("TCM is not a data store") sharpens the responsibility boundary with
  ADR-0020 2-pass execution.