Files
kernbench2/docs/adr/ADR-0038-dev-pcie-ep-component-model.md
T
ywkang 1f36baa898 ADR: add 0038-0042 (pcie_ep, pe_mmu, pe_tcm, sram, tiling)
Fill component-model coverage gaps surfaced by /report's G4 analysis.
Each ADR documents the component's First action, latency model, and
honest notes on dormant code or implementation asymmetries discovered
during re-evaluation against current code.

- 0038 pcie_ep: thin protocol-overhead model; ComponentBase forwarding
  worker as-is; named-node contract for router helpers
- 0039 pe_mmu: component + utility dual role; sub-page region stopgap;
  D2.1 flags pipeline path missing mmu.overhead_ns timeout (asymmetric
  with non-pipeline; not visible at default tlb_overhead_ns=0)
- 0040 pe_tcm: dual-channel BW serialization (read/write Resource cap=1);
  TcmRequest schema owned by TCM; timing-only (no data store)
- 0041 sram: terminal scratchpad model + ResponseMsg on reverse path;
  D1.1 flags _worker override as currently dormant (no Transaction
  actually targets the SRAM node today)
- 0042 tiling: pure plan-generator module, not a component; corrects
  the G4 misclassification; pins GEMM/Math stage sequences and
  epilogue scope contract

Also: /report skill G3 refinement — only flag older->newer asymmetric
cross-references; newer->older (e.g., 0034-0037 citing infrastructure
ADRs) are expected one-way and no longer reported.

Bilingual pair verifier (tools/verify_adr_lang_pairs.py) passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:43:03 -07:00

6.0 KiB

ADR-0038: PCIE_EP Component Model

Status

Accepted (2026-05-20).

Companion to ADR-0035 (M_CPU), ADR-0036 (IO_CPU), and ADR-0037 (Forwarding) at the same component-model level.

First action

Pull one Transaction from _inbox and let _forward_txn invoke run(), which applies a single env.timeout(node.attrs["overhead_ns"]) for PCIe protocol handling. After that the standard ComponentBase worker rules take over: if next_hop exists, put the advanced Transaction on out_ports[next_hop]; otherwise consume drain_ns and call txn.done.succeed().

In other words, PCIE_EP's first (and only) act is to spend the configured overhead as simulator time — no routing decisions, no payload transformation, no MMIO decoding.

Context

PCIE_EP is the host ↔ device boundary in the topology graph. The builder (topology/builder.py) creates an IO chiplet instance per SIP that contains pcie_ep, io_cpu, and io_noc, and lays bidirectional edges between the external fabric.switch0 and each pcie_ep:

  • switch → pcie_ep: host → device traffic (MemoryWrite, MemoryRead, KernelLaunch).
  • pcie_ep → switch: device-side outbound (e.g., cross-SIP IPCQ tokens).

Inside the IO chiplet there are bidirectional pcie_ep ↔ io_noc edges, and from there traffic branches to io_cpu or to the cube-side hbm_ctrl path (see ADR-0036 IO_CPU model). The router and resolver already know — per SPEC R7 — that PCIE_EP is the endpoint for memory operations, so helpers like find_pcie_ep(sip) and find_memory_path(pcie_ep, dst_node) treat PCIE_EP as the start (or end) of the memory path.

The problem is that all of this dependency lives in builder/router/resolver, while PCIE_EP's own internal model has no ADR. The consequence:

  • "What latency does PCIE_EP model?" requires reading the source.
  • The asymmetry with peer components (IO_CPU = ADR-0036, M_CPU = ADR-0035) is awkward.
  • Future decisions about a more detailed PCIe link-layer model (TLP credits, retry, MPS chunking) lack a documented baseline.

This ADR pins down the current thin PCIE_EP model and records that this thinness is intentional (aligned with ADR-0033's latency-model simplification policy).

Decision

D1. PCIE_EP uses ComponentBase's generic forwarding worker as-is

PcieEpComponent extends ComponentBase and does not override _worker or _forward_txn. Every Transaction flows through the standard sequence:

  1. _fan_in accumulates inbound messages (and reassembles Flits, per ADR-0033 Phase 2c) into _inbox.
  2. _worker pulls one message off _inbox and spawns env.process(self._forward_txn(env, txn)) for per-message pipelining.
  3. _forward_txn calls the op_log start hook → run() for latency → op_log end hook.
  4. run() is a single line: yield env.timeout(overhead_ns).
  5. If a next hop exists, out_ports[next_hop].put(txn.advance()). Otherwise (terminal arrival) consume drain_ns and call txn.done.succeed().

D2. The only timing parameter is overhead_ns

Only node.attrs["overhead_ns"] is accepted as a latency parameter. The code default is 0.0; topology.yaml's IOChiplet components.pcie_ep.attrs supplies the real value (current topology: overhead_ns: 5.0 ns).

No separate BW-serialization resource (simpy.Resource), no queue depth, no retry model is introduced. Link-level BW serialization is handled wire-side — inside the IOChiplet by pcie_ep_to_noc_bw_gbs = 256.0 GB/s, and externally by the system's io_ep_to_switch link BW (ADR-0015 port/wire model). PCIE_EP itself takes no part in that accounting.

D3. PCIE_EP is direction-aware in topology but direction-blind in code

The builder lays both switch ↔ pcie_ep and pcie_ep ↔ io_noc edges, so PCIE_EP serves:

  • inbound (host → device): forward Transactions arriving from the switch onto io_noc-side next-hop.
  • outbound (device → host): forward Transactions arriving from io_noc/io_cpu back to the switch.

Both are handled by D1's generic forwarding worker; the component code never distinguishes direction (it just follows txn.next_hop).

D4. PCIE_EP is not Flit-aware (legacy reassembly path)

_FLIT_AWARE is left at the inherited False, so _fan_in reassembles upstream-chunkified Flits into the parent Transaction before delivery to _inbox (aligned with ADR-0033 Phase 2c incremental rollout).

A future PCIe TLP-level credit model would revisit D4.

D5. PCIE_EP is a named node for routing helpers

policy/routing/router.py provides find_pcie_ep(sip, io_id="io0"), find_all_pcie_eps(), and find_memory_path(pcie_ep, dst_node) — all of which treat PCIE_EP as the start (or end) of the memory path. The component itself supplies no information to these helpers; the naming convention (sip{S}.{io_id}.pcie_ep) is guaranteed by the topology builder.

Alternatives Considered

A1. Full PCIe TLP-level model (credits, retry, MPS chunking)

Rejected. Violates ADR-0033's "current latency model = abstract overhead + BW serialization" simplification. Host↔device protocol fidelity is explicitly out-of-scope in SPEC §5 "Non-Goals".

A2. Per-PCIE_EP simpy.Resource for in-flight cap

Rejected. Host traffic is not a contention bottleneck in current workloads. Defer to a separate ADR if it becomes one (in which case D1 stays and D2 is extended).

A3. Merge PCIE_EP into IO_CPU

Rejected. PCIE_EP is the protocol-boundary node first hit on the host side; IO_CPU is the device-side control-plane processing node (ADR-0036). Traffic fan-out and command decoding costs concentrate in IO_CPU, while PCIE_EP only expresses link-edge overhead. Merging them would mix two responsibilities and violate the spirit of ADR-0007 (runtime API/sim_engine boundaries).

Consequences

  • PCIE_EP gets an explicit model ADR despite having near-zero code — consistent with peer component ADRs, lower maintenance friction.
  • Future PCIe-level refinement supersedes by extending D2/D4 in a new ADR.
  • D5 makes the named-node dependency explicit, so any future renaming of component IDs has a clearly bounded blast radius.