Files

T

ywkang 1f36baa898 ADR: add 0038-0042 (pcie_ep, pe_mmu, pe_tcm, sram, tiling)

Fill component-model coverage gaps surfaced by /report's G4 analysis.
Each ADR documents the component's First action, latency model, and
honest notes on dormant code or implementation asymmetries discovered
during re-evaluation against current code.

- 0038 pcie_ep: thin protocol-overhead model; ComponentBase forwarding
  worker as-is; named-node contract for router helpers
- 0039 pe_mmu: component + utility dual role; sub-page region stopgap;
  D2.1 flags pipeline path missing mmu.overhead_ns timeout (asymmetric
  with non-pipeline; not visible at default tlb_overhead_ns=0)
- 0040 pe_tcm: dual-channel BW serialization (read/write Resource cap=1);
  TcmRequest schema owned by TCM; timing-only (no data store)
- 0041 sram: terminal scratchpad model + ResponseMsg on reverse path;
  D1.1 flags _worker override as currently dormant (no Transaction
  actually targets the SRAM node today)
- 0042 tiling: pure plan-generator module, not a component; corrects
  the G4 misclassification; pins GEMM/Math stage sequences and
  epilogue scope contract

Also: /report skill G3 refinement — only flag older->newer asymmetric
cross-references; newer->older (e.g., 0034-0037 citing infrastructure
ADRs) are expected one-way and no longer reported.

Bilingual pair verifier (tools/verify_adr_lang_pairs.py) passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 14:43:03 -07:00

6.9 KiB

Raw Blame History

ADR-0038: PCIE_EP Component Model

Status

Accepted (2026-05-20).

ADR-0035 (M_CPU), ADR-0036 (IO_CPU), ADR-0037 (Forwarding) 와 같은 결의 컴포넌트-레벨 ADR.

First action (제일 처음에 하는 일)

_inbox에서 Transaction을 한 건 꺼내 _forward_txn을 통해 run()을 호출하고, 그 안에서 node.attrs["overhead_ns"] 만큼 env.timeout()으로 PCIe 프로토콜 처리 지연을 적용한다. 그 이후 시점부터는 일반 ComponentBase 워커가 정의한 forwarding 규약을 따른다 (다음 hop이 있으면 out_ports[next_hop].put(...), 아니면 drain_ns를 소비하고 txn.done.succeed()).

즉, PCIE_EP의 첫 번째 일은 "PCIe 프로토콜 오버헤드를 시간으로 표현하는 것" 하나뿐이고, 라우팅·페이로드 변환·MMIO 디코딩 같은 부가 의사결정은 하지 않는다.

Context

PCIE_EP는 토폴로지 그래프에서 호스트와 디바이스 사이의 단방향 경계 포인트 역할을 한다. 빌더 (topology/builder.py)는 SIP마다 IO chiplet 인스턴스를 생성하고 그 안에 pcie_ep, io_cpu, io_noc을 둔 뒤, 외부 호스트 측의 cross-SIP switch와 pcie_ep 사이에 양방향 엣지를 깐다:

switch → pcie_ep: host → device 트래픽 (MemoryWrite, MemoryRead, KernelLaunch).
pcie_ep → switch: device-side outbound (예: cross-SIP IPCQ 토큰).

IOChiplet 내부적으로는 pcie_ep ↔ io_noc 양방향 엣지가 깔리고, 그 다음 hop이 io_cpu나 cube 측 hbm_ctrl 경로로 분기된다 (ADR-0036 IO_CPU 모델 참고). 라우터·리졸버는 SPEC R7이 요구하는 "PCIE_EP는 메모리 오퍼레이션을 위한 엔드포인트"라는 계약을 이미 인지하고 있어, find_pcie_ep(sip), find_memory_path(pcie_ep, dst_node) 같은 helper가 PCIE_EP를 시작점으로 한다.

문제는 이 모든 의존 관계가 builder/router/resolver 쪽에는 있으나, PCIE_EP 자신의 내부 모델을 명시하는 ADR이 없다는 것이다. 결과적으로:

"PCIE_EP는 어떤 latency를 모델링하나?"가 코드를 읽어야만 답이 나온다.
다른 컴포넌트(IO_CPU=ADR-0036, M_CPU=ADR-0035)와의 비대칭이 발생한다.
향후 PCIe link-layer 모델(예: TLP credit, retry)을 더 정교하게 만들지에 대한 의사결정 근거가 흩어진다.

이 ADR은 현재의 얇은 (thin) PCIE_EP 모델을 명시적으로 못 박고, 그것이 의도된 단순화임을 기록한다 (ADR-0033 latency model 단순화 정책과 정렬).

Decision

D1. PCIE_EP는 ComponentBase의 일반 forwarding 워커를 그대로 사용한다

PcieEpComponent는 ComponentBase를 상속하며 _worker/_forward_txn을 오버라이드하지 않는다. 따라서 모든 Transaction은 다음 순서로 처리된다:

_fan_in이 들어오는 메시지(또는 Flit reassembly된 Transaction)를 _inbox에 적재한다.
_worker가 _inbox에서 하나 꺼내 env.process(self._forward_txn(env, txn))로 포크한다 (per-message 파이프라이닝).
_forward_txn이 op_log 시작 hook → run() 지연 → op_log 종료 hook 순서로 호출한다.
run()은 단 한 줄: yield env.timeout(overhead_ns).
다음 hop이 있으면 out_ports[next_hop].put(txn.advance()), 없으면 (terminal로 도착한 경우) drain_ns를 소비 후 txn.done.succeed().

D2. PCIE_EP의 유일한 시간 모델은 `overhead_ns`다

node.attrs["overhead_ns"]만 latency 파라미터로 인정한다. 코드 기본값은 0.0이며, topology.yaml 의 IOChiplet components.pcie_ep.attrs 가 실제 값을 지정한다 (현재 토폴로지: overhead_ns: 5.0 ns).

별도의 BW 직렬화 자원(simpy.Resource), 큐 깊이, retry 모델은 두지 않는다. 링크-레벨 BW 직렬화는 wire-side에서 처리된다 — IOChiplet 내부는 pcie_ep_to_noc_bw_gbs = 256.0 GB/s 링크, 외부는 system의 io_ep_to_switch 링크 BW가 적용된다 (ADR-0015 port/wire 모델). PCIE_EP 컴포넌트 자체는 이 BW 회계에 관여하지 않는다.

D3. PCIE_EP는 양방향 사용을 인지하지만, 방향에 따라 동작을 바꾸지 않는다

토폴로지 빌더가 switch ↔ pcie_ep 와 pcie_ep ↔ io_noc 양방향 엣지를 깐다. 따라서 PCIE_EP는:

inbound (host→device): switch에서 도착한 Transaction을 io_noc 쪽으로 다음 hop 계산을 통해 forward.
outbound (device→host): io_noc/io_cpu에서 도착한 Transaction을 switch 쪽으로 forward.

두 경우 모두 D1의 일반 forwarding 워커가 처리하며, 컴포넌트 코드 자체는 방향을 구분하지 않는다 (txn.next_hop만 따른다).

D4. PCIE_EP는 Flit-aware가 아니다 (legacy reassembly 경로)

_FLIT_AWARE를 True로 두지 않는다. 따라서 _fan_in이 상류에서 chunkify된 Flit들을 부모 Transaction으로 재조립하여 _inbox에 넣는다 (ADR-0033 Phase 2c 점진적 rollout 정책과 정렬).

PCIE_EP가 PCIe TLP-level credit 모델을 갖도록 확장될 미래에 D4를 재평가한다.

D5. PCIE_EP는 라우팅 helper의 명명된 노드다

policy/routing/router.py의 find_pcie_ep(sip, io_id="io0"), find_all_pcie_eps(), find_memory_path(pcie_ep, dst_node)는 PCIE_EP를 메모리 경로의 시작점(또는 종점)으로 간주한다. 컴포넌트 본체는 이 helper에 어떤 정보도 제공하지 않으며, 명명 규칙(sip{S}.{io_id}.pcie_ep)은 토폴로지 빌더가 보장한다.

Alternatives Considered

A1. PCIe TLP-level 모델 (credit, retry, MPS 분할)

기각. ADR-0033이 명시한 "현재 latency 모델은 abstract overhead + BW 직렬화로 표현"이라는 단순화 원칙에 어긋난다. 호스트↔디바이스 protocol 정합성은 SPEC §5 "Non-Goals"에 의해 의도적으로 out-of-scope이다.

A2. PCIE_EP에 자체 simpy.Resource로 inflight 제한 두기

기각. 현재 워크로드에서 호스트 트래픽은 컨텐션 병목이 아니다. 필요해지는 시점에 별도 ADR로 도입한다 (호환성 측면에서 D1은 그대로 두고 D2를 확장하는 형태).

A3. PCIE_EP를 IO_CPU와 합치기

기각. PCIE_EP는 host-side에서 처음 만나는 protocol boundary 노드이고, IO_CPU는 디바이스-쪽 control-plane 처리 노드다 (ADR-0036). 트래픽 fan-out·command 디코딩 같은 의사결정 비용은 IO_CPU에 모이며, PCIE_EP는 link-edge overhead만 표현하는 것이 의미가 있다. 합치면 두 책임이 섞여 ADR-0007 (runtime API/sim_engine 경계) 정신에 어긋난다.

Consequences

PCIE_EP는 코드 라인이 거의 0인 채로 명시적인 모델 ADR을 갖게 된다 — 일관성 ↑, 유지보수 비용 ↓.
향후 PCIe-level 정밀화가 필요해지면 D2/D4를 확장하는 새 ADR을 만들어 supersede한다.
find_memory_path 등 router helper가 PCIE_EP를 명명된 노드로 의존한다는 사실이 D5에서 명시되므로, 컴포넌트 ID 명명 규칙 변경 시 영향 범위가 명확해진다.

6.9 KiB Raw Blame History