1f36baa898
Fill component-model coverage gaps surfaced by /report's G4 analysis. Each ADR documents the component's First action, latency model, and honest notes on dormant code or implementation asymmetries discovered during re-evaluation against current code. - 0038 pcie_ep: thin protocol-overhead model; ComponentBase forwarding worker as-is; named-node contract for router helpers - 0039 pe_mmu: component + utility dual role; sub-page region stopgap; D2.1 flags pipeline path missing mmu.overhead_ns timeout (asymmetric with non-pipeline; not visible at default tlb_overhead_ns=0) - 0040 pe_tcm: dual-channel BW serialization (read/write Resource cap=1); TcmRequest schema owned by TCM; timing-only (no data store) - 0041 sram: terminal scratchpad model + ResponseMsg on reverse path; D1.1 flags _worker override as currently dormant (no Transaction actually targets the SRAM node today) - 0042 tiling: pure plan-generator module, not a component; corrects the G4 misclassification; pins GEMM/Math stage sequences and epilogue scope contract Also: /report skill G3 refinement — only flag older->newer asymmetric cross-references; newer->older (e.g., 0034-0037 citing infrastructure ADRs) are expected one-way and no longer reported. Bilingual pair verifier (tools/verify_adr_lang_pairs.py) passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
140 lines
6.0 KiB
Markdown
140 lines
6.0 KiB
Markdown
# ADR-0038: PCIE_EP Component Model
|
|
|
|
## Status
|
|
|
|
Accepted (2026-05-20).
|
|
|
|
Companion to ADR-0035 (M_CPU), ADR-0036 (IO_CPU), and
|
|
ADR-0037 (Forwarding) at the same component-model level.
|
|
|
|
## First action
|
|
|
|
Pull one Transaction from `_inbox` and let `_forward_txn` invoke `run()`, which
|
|
applies a single `env.timeout(node.attrs["overhead_ns"])` for PCIe protocol
|
|
handling. After that the standard `ComponentBase` worker rules take over: if
|
|
`next_hop` exists, put the advanced Transaction on `out_ports[next_hop]`;
|
|
otherwise consume `drain_ns` and call `txn.done.succeed()`.
|
|
|
|
In other words, **PCIE_EP's first (and only) act is to spend the configured
|
|
overhead as simulator time** — no routing decisions, no payload transformation,
|
|
no MMIO decoding.
|
|
|
|
## Context
|
|
|
|
PCIE_EP is the **host ↔ device boundary** in the topology graph. The builder
|
|
(`topology/builder.py`) creates an IO chiplet instance per SIP that contains
|
|
`pcie_ep`, `io_cpu`, and `io_noc`, and lays bidirectional edges between the
|
|
external `fabric.switch0` and each `pcie_ep`:
|
|
|
|
- `switch → pcie_ep`: host → device traffic (MemoryWrite, MemoryRead,
|
|
KernelLaunch).
|
|
- `pcie_ep → switch`: device-side outbound (e.g., cross-SIP IPCQ tokens).
|
|
|
|
Inside the IO chiplet there are bidirectional `pcie_ep ↔ io_noc` edges, and
|
|
from there traffic branches to `io_cpu` or to the cube-side `hbm_ctrl` path
|
|
(see ADR-0036 IO_CPU model). The router and resolver already know — per SPEC
|
|
R7 — that PCIE_EP is the endpoint for memory operations, so helpers like
|
|
`find_pcie_ep(sip)` and `find_memory_path(pcie_ep, dst_node)` treat PCIE_EP as
|
|
the start (or end) of the memory path.
|
|
|
|
The problem is that all of this dependency lives in builder/router/resolver,
|
|
while **PCIE_EP's own internal model has no ADR**. The consequence:
|
|
|
|
- "What latency does PCIE_EP model?" requires reading the source.
|
|
- The asymmetry with peer components (IO_CPU = ADR-0036, M_CPU = ADR-0035) is
|
|
awkward.
|
|
- Future decisions about a more detailed PCIe link-layer model (TLP credits,
|
|
retry, MPS chunking) lack a documented baseline.
|
|
|
|
This ADR pins down the current **thin PCIE_EP model** and records that this
|
|
thinness is intentional (aligned with ADR-0033's latency-model simplification
|
|
policy).
|
|
|
|
## Decision
|
|
|
|
### D1. PCIE_EP uses ComponentBase's generic forwarding worker as-is
|
|
|
|
`PcieEpComponent` extends `ComponentBase` and does **not** override `_worker` or
|
|
`_forward_txn`. Every Transaction flows through the standard sequence:
|
|
|
|
1. `_fan_in` accumulates inbound messages (and reassembles Flits, per ADR-0033
|
|
Phase 2c) into `_inbox`.
|
|
2. `_worker` pulls one message off `_inbox` and spawns
|
|
`env.process(self._forward_txn(env, txn))` for per-message pipelining.
|
|
3. `_forward_txn` calls the op_log start hook → `run()` for latency → op_log
|
|
end hook.
|
|
4. `run()` is a single line: `yield env.timeout(overhead_ns)`.
|
|
5. If a next hop exists, `out_ports[next_hop].put(txn.advance())`. Otherwise
|
|
(terminal arrival) consume `drain_ns` and call `txn.done.succeed()`.
|
|
|
|
### D2. The only timing parameter is `overhead_ns`
|
|
|
|
Only `node.attrs["overhead_ns"]` is accepted as a latency parameter. The code
|
|
default is `0.0`; `topology.yaml`'s IOChiplet `components.pcie_ep.attrs`
|
|
supplies the real value (current topology: `overhead_ns: 5.0` ns).
|
|
|
|
No separate BW-serialization resource (`simpy.Resource`), no queue depth, no
|
|
retry model is introduced. Link-level BW serialization is handled wire-side —
|
|
inside the IOChiplet by `pcie_ep_to_noc_bw_gbs = 256.0 GB/s`, and externally by
|
|
the system's `io_ep_to_switch` link BW (ADR-0015 port/wire model). PCIE_EP
|
|
itself takes no part in that accounting.
|
|
|
|
### D3. PCIE_EP is direction-aware in topology but direction-blind in code
|
|
|
|
The builder lays both `switch ↔ pcie_ep` and `pcie_ep ↔ io_noc` edges, so
|
|
PCIE_EP serves:
|
|
|
|
- inbound (host → device): forward Transactions arriving from the switch onto
|
|
io_noc-side next-hop.
|
|
- outbound (device → host): forward Transactions arriving from io_noc/io_cpu
|
|
back to the switch.
|
|
|
|
Both are handled by D1's generic forwarding worker; the component code never
|
|
distinguishes direction (it just follows `txn.next_hop`).
|
|
|
|
### D4. PCIE_EP is not Flit-aware (legacy reassembly path)
|
|
|
|
`_FLIT_AWARE` is left at the inherited `False`, so `_fan_in` reassembles
|
|
upstream-chunkified Flits into the parent Transaction before delivery to
|
|
`_inbox` (aligned with ADR-0033 Phase 2c incremental rollout).
|
|
|
|
A future PCIe TLP-level credit model would revisit D4.
|
|
|
|
### D5. PCIE_EP is a **named node** for routing helpers
|
|
|
|
`policy/routing/router.py` provides `find_pcie_ep(sip, io_id="io0")`,
|
|
`find_all_pcie_eps()`, and `find_memory_path(pcie_ep, dst_node)` — all of
|
|
which treat PCIE_EP as the start (or end) of the memory path. The component
|
|
itself supplies no information to these helpers; the naming convention
|
|
(`sip{S}.{io_id}.pcie_ep`) is guaranteed by the topology builder.
|
|
|
|
## Alternatives Considered
|
|
|
|
### A1. Full PCIe TLP-level model (credits, retry, MPS chunking)
|
|
|
|
Rejected. Violates ADR-0033's "current latency model = abstract overhead + BW
|
|
serialization" simplification. Host↔device protocol fidelity is explicitly
|
|
out-of-scope in SPEC §5 "Non-Goals".
|
|
|
|
### A2. Per-PCIE_EP `simpy.Resource` for in-flight cap
|
|
|
|
Rejected. Host traffic is not a contention bottleneck in current workloads.
|
|
Defer to a separate ADR if it becomes one (in which case D1 stays and D2 is
|
|
extended).
|
|
|
|
### A3. Merge PCIE_EP into IO_CPU
|
|
|
|
Rejected. PCIE_EP is the protocol-boundary node first hit on the host side;
|
|
IO_CPU is the device-side control-plane processing node (ADR-0036). Traffic
|
|
fan-out and command decoding costs concentrate in IO_CPU, while PCIE_EP only
|
|
expresses link-edge overhead. Merging them would mix two responsibilities and
|
|
violate the spirit of ADR-0007 (runtime API/sim_engine boundaries).
|
|
|
|
## Consequences
|
|
|
|
- PCIE_EP gets an explicit model ADR despite having near-zero code — consistent
|
|
with peer component ADRs, lower maintenance friction.
|
|
- Future PCIe-level refinement supersedes by extending D2/D4 in a new ADR.
|
|
- D5 makes the named-node dependency explicit, so any future renaming of
|
|
component IDs has a clearly bounded blast radius.
|