ADR: add 0038-0042 (pcie_ep, pe_mmu, pe_tcm, sram, tiling)
Fill component-model coverage gaps surfaced by /report's G4 analysis. Each ADR documents the component's First action, latency model, and honest notes on dormant code or implementation asymmetries discovered during re-evaluation against current code. - 0038 pcie_ep: thin protocol-overhead model; ComponentBase forwarding worker as-is; named-node contract for router helpers - 0039 pe_mmu: component + utility dual role; sub-page region stopgap; D2.1 flags pipeline path missing mmu.overhead_ns timeout (asymmetric with non-pipeline; not visible at default tlb_overhead_ns=0) - 0040 pe_tcm: dual-channel BW serialization (read/write Resource cap=1); TcmRequest schema owned by TCM; timing-only (no data store) - 0041 sram: terminal scratchpad model + ResponseMsg on reverse path; D1.1 flags _worker override as currently dormant (no Transaction actually targets the SRAM node today) - 0042 tiling: pure plan-generator module, not a component; corrects the G4 misclassification; pins GEMM/Math stage sequences and epilogue scope contract Also: /report skill G3 refinement — only flag older->newer asymmetric cross-references; newer->older (e.g., 0034-0037 citing infrastructure ADRs) are expected one-way and no longer reported. Bilingual pair verifier (tools/verify_adr_lang_pairs.py) passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,195 @@
|
||||
# ADR-0041: Cube SRAM Component Model — terminal scratchpad on cube NoC
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-20).
|
||||
|
||||
ADR-0017 (Cube NOC and HBM Connectivity) describes SRAM as a cube-NoC
|
||||
attachment but does not specify the SRAM component's own latency / response
|
||||
model. This ADR fills that gap.
|
||||
|
||||
## First action
|
||||
|
||||
Inside `_worker`, immediately after pulling a Transaction off `_inbox`, the
|
||||
very first action is `yield from self.run(env, txn.nbytes)`. Inside `run()`,
|
||||
the component applies `env.timeout(node.attrs["overhead_ns"])`
|
||||
(default `0.0`).
|
||||
|
||||
In short, **SRAM's first act is "express access overhead as simulator time"**.
|
||||
After overhead, the worker yields `drain_ns` (the terminal BW-serialization
|
||||
cost stamped on the Transaction) and then constructs and dispatches a
|
||||
`ResponseMsg` on the reverse path.
|
||||
|
||||
This differs from a generic `ComponentBase._worker`: SRAM knows it is a
|
||||
**terminal node**, so it does not go through `_forward_txn`. Its own worker
|
||||
explicitly performs `run → drain → _send_response`.
|
||||
|
||||
## Context
|
||||
|
||||
The cube topology (`topology/builder.py`) creates the following named nodes
|
||||
per cube:
|
||||
|
||||
- `sip{S}.cube{C}.m_cpu`
|
||||
- `sip{S}.cube{C}.sram`
|
||||
- `sip{S}.cube{C}.hbm_ctrl` (per-PE partitions)
|
||||
- `sip{S}.cube{C}.pe{P}` (and its PE-internal sub-components)
|
||||
|
||||
SRAM is one of the cube-NoC attachments — `topology/mesh_gen.py` assigns it
|
||||
to the nearest router by placement coordinates and adds `"sram"` to that
|
||||
router's `attach` list. The builder lays bidirectional `sram ↔ router` edges
|
||||
(BW: `sram_to_router_bw_gbs`, default `128.0 GB/s`).
|
||||
|
||||
SRAM has two intertwined roles:
|
||||
|
||||
1. **Fabric terminal**: the endpoint for cube-NoC memory-access Transactions
|
||||
destined for SRAM. SRAM consumes access overhead + drain, then sends a
|
||||
response back on the reverse path.
|
||||
2. **One of the IPCQ slot tiers**: ADR-0023 D9.7 defines
|
||||
`buffer_kind ∈ {tcm, sram, hbm}`; the `sram` tier's per-access cost is
|
||||
`(512.0 GB/s, 2.0 ns)` in `common/ipcq_types._BUFFER_KIND_BW`. This is
|
||||
separate from the SRAM node's `overhead_ns` attr; PE_DMA accounts for it
|
||||
directly at the IPCQ slot-write moment.
|
||||
|
||||
Without an ADR covering both roles, the following questions are ambiguous:
|
||||
|
||||
- "What latency does SRAM model?" — fabric drain + overhead, or the IPCQ
|
||||
tier slot latency? — answers scatter.
|
||||
- What does the `size_mb` (`32`) attr mean in the future? Currently it is not
|
||||
used; SRAM only models timing.
|
||||
- Which cube router does SRAM attach to? (placement-based; lives in topology
|
||||
code only.)
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. SRAM is a terminal scratchpad node on the cube NoC
|
||||
|
||||
`SramComponent` extends `ComponentBase` but overrides `_worker` to express
|
||||
terminal semantics directly:
|
||||
|
||||
```
|
||||
while True:
|
||||
txn = yield self._inbox.get()
|
||||
yield from self.run(env, txn.nbytes) # overhead_ns
|
||||
if drain_ns > 0: yield env.timeout(drain_ns)
|
||||
yield from self._send_response(env, txn)
|
||||
```
|
||||
|
||||
This pattern is necessary because SRAM must know the reverse path; the
|
||||
generic `_forward_txn` (which forwards to the next hop) does not fit a
|
||||
terminal.
|
||||
|
||||
#### D1.1. Currently dormant — the `_worker` override is an unused path
|
||||
|
||||
At the time of writing, **no component actually sends a Transaction to the
|
||||
SRAM node**. The verified references to the SRAM node ID are:
|
||||
|
||||
- `policy/routing/router.py` and friends — guarantee path lookups.
|
||||
- `components/builtin/pe_dma.py::_handle_ipcq_inbound` — for
|
||||
`buffer_kind == "sram"`, computes the *path* to
|
||||
`bank_node = f"{cube_prefix}.sram"` via `compute_drain_ns(path, ...)` and
|
||||
yields a **local** timeout. The Transaction itself does not flow to the
|
||||
SRAM node (see D4).
|
||||
- `tests/test_routing.py` — checks connectivity via
|
||||
`find_path("sip0.cube0.pe0", "sip0.cube0.sram")`.
|
||||
|
||||
So the `_worker` / `_send_response` override is currently a **dormant code
|
||||
path**. It is preserved deliberately:
|
||||
|
||||
- Topology changes that route fabric Transactions to SRAM terminally (e.g.,
|
||||
explicit M_CPU → SRAM accesses) would activate it immediately.
|
||||
- ADR-0017's "cube-attached scratchpad" semantics naturally implies terminal
|
||||
behavior; the override is an intentional placeholder.
|
||||
|
||||
A future ADR (or a revision to this one) will mark dormancy resolved when an
|
||||
actual sender is added.
|
||||
|
||||
### D2. ResponseMsg construction and reverse-path dispatch
|
||||
|
||||
`_send_response`:
|
||||
|
||||
1. `reverse_path = list(reversed(txn.path))` — derive the reverse path.
|
||||
2. Construct `ResponseMsg(correlation_id=txn.request.correlation_id,
|
||||
request_id=..., src_cube=<this cube>, src_pe=-1, success=True)`.
|
||||
3. Wrap in `Transaction(request=resp_msg, path=reverse_path, step=0,
|
||||
nbytes=0, done=env.event(), is_response=True)` and put on
|
||||
`out_ports[reverse_path[1]]`.
|
||||
4. If the reverse path is too short (`< 2 hops`) or `ctx` is absent, fall
|
||||
back to calling the original `txn.done.succeed()`.
|
||||
|
||||
`src_pe = -1` means "SRAM is not PE-localized". `src_cube` is parsed from the
|
||||
node ID (`sip{S}.cube{C}.sram`).
|
||||
|
||||
### D3. Timing parameters: `overhead_ns` and wire-side `drain_ns`
|
||||
|
||||
- **Component-side latency**: `node.attrs["overhead_ns"]`. Default topology
|
||||
uses `2.0 ns`.
|
||||
- **Link-side serialization**: `drain_ns` arrives stamped on the Transaction
|
||||
— the wire-side BW serialization result from ADR-0015. SRAM only yields it.
|
||||
- The `size_mb` (default `32 MiB`) attr is currently timing-neutral. If a
|
||||
capacity-aware model is added in the future, a separate ADR will give it
|
||||
meaning.
|
||||
|
||||
### D4. IPCQ slot accounting is not modeled by the SRAM component
|
||||
|
||||
Per ADR-0023 D9.7, the IPCQ slot-write latency for the SRAM tier is incurred
|
||||
inside PE_DMA's `_handle_ipcq_inbound`, which calls
|
||||
`slot_io_latency_ns("sram", nbytes)` using `_BUFFER_KIND_BW["sram"]`. That is:
|
||||
|
||||
- When SRAM receives a fabric Transaction (D1, D2, D3 apply), it processes
|
||||
normally.
|
||||
- When an IPCQ slot lives on SRAM, PE_DMA pays the slot-write time directly —
|
||||
independent of the SRAM component.
|
||||
|
||||
This separation is intentional: IPCQ is a fast path (sub-cycle slot
|
||||
bookkeeping) and does not traverse fabric Transactions, so SRAM does not need
|
||||
to know about IPCQ.
|
||||
|
||||
### D5. SRAM's cube-NoC attachment is placement-driven
|
||||
|
||||
`topology/mesh_gen.py` reads `placement.sram.pos_mm` (default `[1.5, 9.0]` in
|
||||
`topology.yaml`) and adds `"sram"` to the nearest router's `attach`. The
|
||||
builder (`topology/builder.py`'s attachment loop) then lays bidirectional
|
||||
edges between the `sram` node and that router.
|
||||
|
||||
This decision lives outside the SRAM component (mesh_gen / builder); the
|
||||
component does not know which router it sits on. It only relies on
|
||||
`txn.path` / `reverse_path` to reach it via a router.
|
||||
|
||||
### D6. SRAM is not a data store (timing only)
|
||||
|
||||
Same context as ADR-0040 D6: the SRAM component models time only; the data
|
||||
payload (if any) lives in sim_engine's `memory_store`.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. Use `_forward_txn` and route responses via separate nodes (à la IO_CPU / HBM_CTRL)
|
||||
|
||||
Rejected. SRAM is a terminal on the cube NoC; adding a response node would
|
||||
introduce meaningless hops and violate ADR-0017's simplification spirit.
|
||||
|
||||
### A2. Model BW serialization inside SRAM with its own resource
|
||||
|
||||
Rejected. Wire-side BW serialization (`drain_ns`) already captures it. An
|
||||
internal `simpy.Resource` would double-count against ADR-0015 (port/wire
|
||||
model).
|
||||
|
||||
### A3. Handle IPCQ slot accounting in the SRAM component
|
||||
|
||||
Rejected. As D4 makes explicit, IPCQ is a fast path that does not traverse
|
||||
fabric Transactions. If SRAM knew about IPCQ, the responsibility would split
|
||||
across two places and obscure reasoning.
|
||||
|
||||
### A4. Capacity-aware latency from `size_mb`
|
||||
|
||||
Rejected for now. The capacity is currently a visualizer label; introducing
|
||||
a capacity-aware timing model requires a dedicated ADR.
|
||||
|
||||
## Consequences
|
||||
|
||||
- SRAM's timing model is pinned at ADR level as
|
||||
`overhead_ns + drain_ns + ResponseMsg(reverse_path)`. Any proposal to push
|
||||
IPCQ slot latency into the SRAM component can be refused with D4.
|
||||
- D3 records that `size_mb` is timing-neutral today, so a future
|
||||
capacity-aware model has a narrow compatibility scope.
|
||||
- D5 documents the placement-driven attachment, so changes to the SRAM
|
||||
coordinate have a clearly bounded impact (`mesh_gen` only).
|
||||
Reference in New Issue
Block a user