Files
kernbench2/docs/adr/ADR-0041-dev-cube-sram-component-model.md
T
ywkang 1f36baa898 ADR: add 0038-0042 (pcie_ep, pe_mmu, pe_tcm, sram, tiling)
Fill component-model coverage gaps surfaced by /report's G4 analysis.
Each ADR documents the component's First action, latency model, and
honest notes on dormant code or implementation asymmetries discovered
during re-evaluation against current code.

- 0038 pcie_ep: thin protocol-overhead model; ComponentBase forwarding
  worker as-is; named-node contract for router helpers
- 0039 pe_mmu: component + utility dual role; sub-page region stopgap;
  D2.1 flags pipeline path missing mmu.overhead_ns timeout (asymmetric
  with non-pipeline; not visible at default tlb_overhead_ns=0)
- 0040 pe_tcm: dual-channel BW serialization (read/write Resource cap=1);
  TcmRequest schema owned by TCM; timing-only (no data store)
- 0041 sram: terminal scratchpad model + ResponseMsg on reverse path;
  D1.1 flags _worker override as currently dormant (no Transaction
  actually targets the SRAM node today)
- 0042 tiling: pure plan-generator module, not a component; corrects
  the G4 misclassification; pins GEMM/Math stage sequences and
  epilogue scope contract

Also: /report skill G3 refinement — only flag older->newer asymmetric
cross-references; newer->older (e.g., 0034-0037 citing infrastructure
ADRs) are expected one-way and no longer reported.

Bilingual pair verifier (tools/verify_adr_lang_pairs.py) passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:43:03 -07:00

7.9 KiB

ADR-0041: Cube SRAM Component Model — terminal scratchpad on cube NoC

Status

Accepted (2026-05-20).

ADR-0017 (Cube NOC and HBM Connectivity) describes SRAM as a cube-NoC attachment but does not specify the SRAM component's own latency / response model. This ADR fills that gap.

First action

Inside _worker, immediately after pulling a Transaction off _inbox, the very first action is yield from self.run(env, txn.nbytes). Inside run(), the component applies env.timeout(node.attrs["overhead_ns"]) (default 0.0).

In short, SRAM's first act is "express access overhead as simulator time". After overhead, the worker yields drain_ns (the terminal BW-serialization cost stamped on the Transaction) and then constructs and dispatches a ResponseMsg on the reverse path.

This differs from a generic ComponentBase._worker: SRAM knows it is a terminal node, so it does not go through _forward_txn. Its own worker explicitly performs run → drain → _send_response.

Context

The cube topology (topology/builder.py) creates the following named nodes per cube:

  • sip{S}.cube{C}.m_cpu
  • sip{S}.cube{C}.sram
  • sip{S}.cube{C}.hbm_ctrl (per-PE partitions)
  • sip{S}.cube{C}.pe{P} (and its PE-internal sub-components)

SRAM is one of the cube-NoC attachments — topology/mesh_gen.py assigns it to the nearest router by placement coordinates and adds "sram" to that router's attach list. The builder lays bidirectional sram ↔ router edges (BW: sram_to_router_bw_gbs, default 128.0 GB/s).

SRAM has two intertwined roles:

  1. Fabric terminal: the endpoint for cube-NoC memory-access Transactions destined for SRAM. SRAM consumes access overhead + drain, then sends a response back on the reverse path.
  2. One of the IPCQ slot tiers: ADR-0023 D9.7 defines buffer_kind ∈ {tcm, sram, hbm}; the sram tier's per-access cost is (512.0 GB/s, 2.0 ns) in common/ipcq_types._BUFFER_KIND_BW. This is separate from the SRAM node's overhead_ns attr; PE_DMA accounts for it directly at the IPCQ slot-write moment.

Without an ADR covering both roles, the following questions are ambiguous:

  • "What latency does SRAM model?" — fabric drain + overhead, or the IPCQ tier slot latency? — answers scatter.
  • What does the size_mb (32) attr mean in the future? Currently it is not used; SRAM only models timing.
  • Which cube router does SRAM attach to? (placement-based; lives in topology code only.)

Decision

D1. SRAM is a terminal scratchpad node on the cube NoC

SramComponent extends ComponentBase but overrides _worker to express terminal semantics directly:

while True:
    txn = yield self._inbox.get()
    yield from self.run(env, txn.nbytes)     # overhead_ns
    if drain_ns > 0: yield env.timeout(drain_ns)
    yield from self._send_response(env, txn)

This pattern is necessary because SRAM must know the reverse path; the generic _forward_txn (which forwards to the next hop) does not fit a terminal.

D1.1. Currently dormant — the _worker override is an unused path

At the time of writing, no component actually sends a Transaction to the SRAM node. The verified references to the SRAM node ID are:

  • policy/routing/router.py and friends — guarantee path lookups.
  • components/builtin/pe_dma.py::_handle_ipcq_inbound — for buffer_kind == "sram", computes the path to bank_node = f"{cube_prefix}.sram" via compute_drain_ns(path, ...) and yields a local timeout. The Transaction itself does not flow to the SRAM node (see D4).
  • tests/test_routing.py — checks connectivity via find_path("sip0.cube0.pe0", "sip0.cube0.sram").

So the _worker / _send_response override is currently a dormant code path. It is preserved deliberately:

  • Topology changes that route fabric Transactions to SRAM terminally (e.g., explicit M_CPU → SRAM accesses) would activate it immediately.
  • ADR-0017's "cube-attached scratchpad" semantics naturally implies terminal behavior; the override is an intentional placeholder.

A future ADR (or a revision to this one) will mark dormancy resolved when an actual sender is added.

D2. ResponseMsg construction and reverse-path dispatch

_send_response:

  1. reverse_path = list(reversed(txn.path)) — derive the reverse path.
  2. Construct ResponseMsg(correlation_id=txn.request.correlation_id, request_id=..., src_cube=<this cube>, src_pe=-1, success=True).
  3. Wrap in Transaction(request=resp_msg, path=reverse_path, step=0, nbytes=0, done=env.event(), is_response=True) and put on out_ports[reverse_path[1]].
  4. If the reverse path is too short (< 2 hops) or ctx is absent, fall back to calling the original txn.done.succeed().

src_pe = -1 means "SRAM is not PE-localized". src_cube is parsed from the node ID (sip{S}.cube{C}.sram).

D3. Timing parameters: overhead_ns and wire-side drain_ns

  • Component-side latency: node.attrs["overhead_ns"]. Default topology uses 2.0 ns.
  • Link-side serialization: drain_ns arrives stamped on the Transaction — the wire-side BW serialization result from ADR-0015. SRAM only yields it.
  • The size_mb (default 32 MiB) attr is currently timing-neutral. If a capacity-aware model is added in the future, a separate ADR will give it meaning.

D4. IPCQ slot accounting is not modeled by the SRAM component

Per ADR-0023 D9.7, the IPCQ slot-write latency for the SRAM tier is incurred inside PE_DMA's _handle_ipcq_inbound, which calls slot_io_latency_ns("sram", nbytes) using _BUFFER_KIND_BW["sram"]. That is:

  • When SRAM receives a fabric Transaction (D1, D2, D3 apply), it processes normally.
  • When an IPCQ slot lives on SRAM, PE_DMA pays the slot-write time directly — independent of the SRAM component.

This separation is intentional: IPCQ is a fast path (sub-cycle slot bookkeeping) and does not traverse fabric Transactions, so SRAM does not need to know about IPCQ.

D5. SRAM's cube-NoC attachment is placement-driven

topology/mesh_gen.py reads placement.sram.pos_mm (default [1.5, 9.0] in topology.yaml) and adds "sram" to the nearest router's attach. The builder (topology/builder.py's attachment loop) then lays bidirectional edges between the sram node and that router.

This decision lives outside the SRAM component (mesh_gen / builder); the component does not know which router it sits on. It only relies on txn.path / reverse_path to reach it via a router.

D6. SRAM is not a data store (timing only)

Same context as ADR-0040 D6: the SRAM component models time only; the data payload (if any) lives in sim_engine's memory_store.

Alternatives Considered

A1. Use _forward_txn and route responses via separate nodes (à la IO_CPU / HBM_CTRL)

Rejected. SRAM is a terminal on the cube NoC; adding a response node would introduce meaningless hops and violate ADR-0017's simplification spirit.

A2. Model BW serialization inside SRAM with its own resource

Rejected. Wire-side BW serialization (drain_ns) already captures it. An internal simpy.Resource would double-count against ADR-0015 (port/wire model).

A3. Handle IPCQ slot accounting in the SRAM component

Rejected. As D4 makes explicit, IPCQ is a fast path that does not traverse fabric Transactions. If SRAM knew about IPCQ, the responsibility would split across two places and obscure reasoning.

A4. Capacity-aware latency from size_mb

Rejected for now. The capacity is currently a visualizer label; introducing a capacity-aware timing model requires a dedicated ADR.

Consequences

  • SRAM's timing model is pinned at ADR level as overhead_ns + drain_ns + ResponseMsg(reverse_path). Any proposal to push IPCQ slot latency into the SRAM component can be refused with D4.
  • D3 records that size_mb is timing-neutral today, so a future capacity-aware model has a narrow compatibility scope.
  • D5 documents the placement-driven attachment, so changes to the SRAM coordinate have a clearly bounded impact (mesh_gen only).