adr: add ADR-0050-0053 — close /report's second-pass G4 candidates

Documents four cross-cutting surfaces one layer deeper than the prior G4 batch: - 0050 par-ccl-algorithm-module-contract: how to author a new CCL algorithm in src/kernbench/ccl/algorithms/. Pairs with ADR-0045's bench-module contract. Pins the four required public symbols (kernel, kernel_args, TOPO_NAME_TO_KIND constants, kernel alias), the 9 + tl standardized kernel signature, the kernel_args tuple format, sip_topo_kind dispatch, and the ccl.yaml entry workflow. - 0051 lat-routing-helper-api: every public method of AddressResolver (resolve, find_m_cpu, find_pcie_ep, find_io_cpu, find_all_pcie_eps) and PathRouter (find_path, find_path_with_distance, find_mcpu_dma_path, find_memory_path, find_node_path + 2 shims). Pins the four adjacency graphs (_adj_all / _adj / _adj_mcpu_dma / _adj_local) and the edge-kind exclusion sets they use, plus the single-owner naming convention. - 0052 dev-oplog-memory-store-schemas: OpRecord's 7 fields, the per-op_name params matrix (dma_read, dma_write, gemm_*, math, math reduction, composite_gemm, ipcq_copy, unknown), snapshot timing rules (math = all inputs, dma_write = HBM-only — ADR-0027 race avoidance), TileToken stage_type capture, and MemoryStore's (space, addr) two-level dict with reference-store semantics. - 0053 dev-topology-builder-algorithms: the 6-stage compile pipeline, cube_mesh.yaml's source_hash cache and its 5 input fields, the cube NoC auto-layout algorithm (row/col placement, HBM exclusion zone, PE/M_CPU/SRAM attachment via nearest-router, UCIe N/S/E/W distribution), the node naming convention (single-owner with router.py), the edge-kind catalog, the 4 view projections, and a table of spec-field changes vs mesh regeneration. Bilingual pair verifier passes for all four EN/KO pairs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:52:42 -07:00
parent 9a02955770
commit bd49c93703
8 changed files with 2566 additions and 0 deletions
@@ -0,0 +1,267 @@
+# ADR-0051: Routing Helper API — `AddressResolver` + `PathRouter`
+
+## Status
+
+Accepted (2026-05-22).
+
+`policy/routing/router.py` 가 노출하는 두 helper 클래스
+(`AddressResolver`, `PathRouter`) 의 모든 public API, 인자, 반환 값,
+그리고 네 가지 다른 adjacency graph 의 사용처를 명시한다. ADR-0002 가
+routing distance 와 ordering, bypass 규칙을 정의하나, **helper API 표면
+자체** 는 ADR-level 에 정리된 적이 없다.
+
+## First action (제일 처음에 하는 일)
+
+### `AddressResolver(graph)`
+
+생성 즉시 다음 두 가지를 캐시한다:
+
+1. `self._node_ids = set(graph.nodes)` — 모든 node id 의 set (lookup 용).
+2. `self._hbm_slice_bytes = hbm_total_gb * (1 << 30) // slices_per_cube` —
+   `graph.spec.cube.memory_map` 으로부터 산출 (기본 `48 GB / 8 slices = 6
+   GB`). 이 값이 `resolve()` 가 HBM PA 의 `hbm_offset` 에서 `pe_id` 를
+   복원하는 데 쓰인다.
+
+즉, **AddressResolver 의 첫 일은 "전체 node id 집합과 HBM slice 크기를
+미리 계산해 두는 것"** 이다. graph 자체는 보유하지 않는다.
+
+### `PathRouter(graph)`
+
+생성 즉시 **네 개의 별도 adjacency graph 를 동시 구축**한다:
+
+1. `self._adj_all`: 모든 edge 포함 (component-to-component routing 용).
+2. `self._adj`: `kind != "command"` 인 edge 만 (PE DMA / 일반 data path).
+3. `self._adj_mcpu_dma`: `_MCPU_DMA_EXCLUDE = {"pe_internal",
+   "pe_to_router"}` 를 제외 (M_CPU DMA 가 PE pipeline 노드로 잘못 라우팅
+   되지 않게).
+4. `self._adj_local`: `_UCIE_KINDS` 8 종을 제외 (cube-local routing 용 —
+   UCIe 가 zero-distance bus 처럼 보여 Dijkstra 가 mesh 보다 선호하는
+   것을 막음).
+
+각 그래프는 `defaultdict(list)` of `(neighbor, weight)` 형태이며,
+`edge.routing_weight_mm or edge.distance_mm` 이 weight 로 쓰인다.
+
+즉, **PathRouter 의 첫 일은 "topology edge 들을 4개의 다른 정책으로 동시
+분류하여 4 개의 인접 리스트로 구축하는 것"**. 매 `find_*()` 호출 시 적절
+한 그래프를 골라 Dijkstra 를 돌린다.
+
+## Context
+
+`policy/routing/router.py` 는 다음 두 책임을 함께 수행한다:
+
+- **이름 매핑**: 토폴로지 명명 규칙 (`sip{S}.cube{C}.<comp>`,
+  `sip{S}.io{I}.pcie_ep` 등) 의 단일 소유자. 컴포넌트 / probe / IPCQ
+  install / runtime API 가 이름 문자열을 직접 만들지 않고 helper 를 호출.
+- **경로 결정**: edge 의 `kind` 에 따른 정책 분리. 같은 src→dst 라도
+  routing 의도 (PE DMA vs M_CPU DMA vs general component routing) 에 따라
+  다른 adjacency 를 사용해야 결과가 달라진다.
+
+이 helper API 가 코드 전반에서 광범위하게 소비되는데도 (probe.py /
+distributed.py / install.py / 각종 component / tests), ADR-level 에서
+**정확한 시그너처 / 반환 의미 / 어떤 adjacency 를 쓰는지** 가 한 곳에
+정리되어 있지 않다. 본 ADR 이 그 빈자리를 채운다.
+
+## Decision
+
+### D1. `AddressResolver` 의 5 개 public API
+
+#### D1.1. `resolve(addr: PhysAddr) -> str`
+
+`PhysAddr` 인스턴스를 토폴로지의 destination node id 로 변환.
+
+```
+addr.kind == "hbm"             → f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
+  where pe_id = addr.hbm_offset // self._hbm_slice_bytes  (ADR-0017 D4/D9)
+
+addr.kind == "pe_resource":
+  addr.unit_type == PE         → f"sip{s}.cube{d}.pe{addr.pe_id}.pe_tcm"
+  addr.unit_type == SRAM       → f"sip{s}.cube{d}.sram"
+  addr.unit_type == MCPU       → f"sip{s}.cube{d}.m_cpu"
+  그 외                          → RoutingError("unsupported unit_type")
+
+다른 kind                       → RoutingError("unsupported address kind")
+```
+
+산출된 node id 가 `self._node_ids` 에 없으면 `RoutingError(f"node {node_id}
+not found in topology")`. 즉, address 의 syntax 가 valid 해도 topology 에
+실제로 매핑되는 노드가 없으면 fail-loud.
+
+#### D1.2. `find_m_cpu(sip, cube) -> str`
+
+`f"sip{sip}.cube{cube}.m_cpu"`. 없으면 `RoutingError`.
+
+#### D1.3. `find_pcie_ep(sip, io_id="io0") -> str`
+
+`f"sip{sip}.{io_id}.pcie_ep"`. 없으면 `RoutingError`.
+
+#### D1.4. `find_io_cpu(sip, io_id="io0") -> str`
+
+`f"sip{sip}.{io_id}.io_cpu"`. 없으면 `RoutingError`.
+
+#### D1.5. `find_all_pcie_eps() -> list[str]`
+
+전 SIP 의 PCIE_EP node id 를 정렬된 리스트로 반환. `endswith(".pcie_ep")`
+필터링. cross-SIP IPCQ 가 모든 PCIE_EP 를 enumerate 할 때 사용.
+
+명명 규칙 (`sip{S}.cube{C}.<comp>`, `sip{S}.{io_id}.<comp>`) 의 단일
+소유자가 이 클래스다 (ADR-0015 D4). 토폴로지 빌더가 같은 명명 규칙으로
+노드를 만들고, 컴포넌트는 이름 문자열을 절대 직접 구성하지 않는다 —
+모두 helper 를 거친다.
+
+### D2. `PathRouter` 의 4 개 adjacency graph
+
+생성자가 한 번에 구축. edge `kind` 가 정책을 결정:
+
+| graph             | 제외 edge kinds                               | 용도                                       |
+|-------------------|-----------------------------------------------|--------------------------------------------|
+| `_adj_all`        | (none)                                        | M_CPU↔NOC command 포함, IO_CPU/M_CPU routing |
+| `_adj`            | `"command"`                                   | PE DMA / 일반 data path                    |
+| `_adj_mcpu_dma`   | `"pe_internal"`, `"pe_to_router"`            | M_CPU DMA (PE pipeline 우회)               |
+| `_adj_local`      | `_UCIE_KINDS` (`ucie_internal`, `ucie_conn_to_router`, `router_to_ucie_conn`, `ucie_conn_to_noc`, `noc_to_ucie_conn`, `ucie_mesh`, `io_to_cube`, `cube_to_io`) | same-cube routing (UCIe bus 우회) |
+
+각 그래프는 `dict[node_id, list[(neighbor, weight)]]` 이며, weight 는
+`edge.routing_weight_mm or edge.distance_mm`. command edge 의 routing
+영향력을 명시적으로 가르고, UCIe 의 "0-distance bus" 가 mesh 보다 선호
+되는 것을 막기 위한 `_adj_local` 분리가 ADR-0017 D7 의 cross-PE-slice
+mesh-distance 요구와 정합.
+
+### D3. `PathRouter` 의 6 개 public API (+ 2 backward-compat)
+
+#### D3.1. `find_path(src_pe: str, dst_node: str) -> list[str]`
+
+**PE DMA routing**. `src_pe` 는 PE prefix (예: `"sip0.cube0.pe0"`) 이며,
+함수가 `.pe_dma` 를 자동으로 prepend 하여 실제 시작 노드를
+`"sip0.cube0.pe0.pe_dma"` 로 설정.
+
+cube-local 여부 (`_same_cube`) 에 따라 adjacency 선택:
+
+- **same-cube** (src 와 dst 가 `sip{S}.cube{C}.` prefix 공유):
+  `_adj_local` 사용. UCIe 우회를 막아 cross-PE-slice 가 mesh 거리를 정확
+  히 지불 (ADR-0017 D7).
+- **cross-cube**: `_adj` 사용. UCIe 가 자연스럽게 cross-cube path 의
+  최적 선택지로 포함됨.
+
+#### D3.2. `find_path_with_distance(src_pe, dst_node) -> tuple[list[str], float]`
+
+D3.1 과 동일한 adjacency 정책을 사용하나, 결과로 `(path, total_distance)`
+를 함께 반환. probe / 분석 도구에서 distance 메트릭이 필요할 때 사용.
+
+#### D3.3. `find_mcpu_dma_path(m_cpu_id: str, dst_hbm_id: str) -> list[str]`
+
+**M_CPU DMA path**. cube 가 같으면 `_adj_local` (mesh 안에서 마무리), 다르
+면 `_adj_all` (UCIe 경유). `_MCPU_DMA_EXCLUDE` 가 PE pipeline 노드를 자동
+배제하므로, M_CPU 가 PE 의 내부 stage 를 거쳐 routing 되는 잘못된 경로가
+나오지 않는다.
+
+#### D3.4. `find_memory_path(src: str, dst: str) -> list[str]`
+
+`pcie_ep → io_noc → cube → router mesh → hbm_ctrl` 같은 직접 메모리
+경로. `_adj_mcpu_dma` 를 사용하여 `pe_internal` 및 `pe_to_router` edge
+를 제외 — host-issued read/write 가 PE pipeline 으로 새지 않게 보장.
+probe (ADR-0049 D1 의 H2D/D2H case) 에서 직접 호출.
+
+#### D3.5. `find_node_path(src: str, dst: str) -> list[str]`
+
+임의의 두 node 사이의 path. **command edge 포함** (`_adj_all` 사용). M_CPU
+↔ NOC 같은 command-kind link 를 거쳐야 하는 IoCpuComponent /
+MCpuComponent 등이 호출.
+
+#### D3.6. backward-compat shims
+
+- `_dijkstra(start, goal) -> list[str]` — `_run_dijkstra(self._adj, …)`
+  의 thin wrapper.
+- `_dijkstra_with_dist(start, goal) -> tuple[list[str], float]` — distance
+  포함 버전.
+
+언더스코어 prefix 에서 보듯이 내부 API 인 척이지만 기존 테스트가 직접
+호출. 새 코드는 D3.1–D3.5 를 사용하고, 이 두 shim 은 deprecation 후보.
+
+### D4. Dijkstra 알고리즘 — single-source shortest path
+
+`_run_dijkstra_with_dist(adj, start, goal)`:
+
+- `heapq` priority queue.
+- `best: dict[node, distance]` — 노드별 최단 거리 캐시.
+- `prev: dict[node, predecessor]` — path reconstruction.
+- weight 는 `routing_weight_mm or distance_mm`. UCIe 처럼 routing_weight 가
+  명시되어 distance 와 다른 edge 가 있으므로 weight 분리가 의도된 것.
+
+`start == goal` 은 빠른 path `([start], 0.0)` 반환. 도달 불가는
+`RoutingError(f"no path from {start} to {goal}")`.
+
+이 알고리즘은 **deterministic** 하다 — 같은 graph + start/goal 이면 같은
+경로. 이는 SPEC R1 의 "Routing MUST be deterministic" 요구와 정합. tie-
+break 는 `heapq` 의 push 순서를 따른다 (Python list 순서가 deterministic).
+
+### D5. helper API 의 단일 소유자 원칙
+
+다음 정보는 오직 router.py 안에서만 결정된다:
+
+- 명명 규칙: `sip{S}.cube{C}.<comp>`, `sip{S}.{io_id}.<comp>`,
+  `sip{S}.cube{C}.hbm_ctrl.pe{pe_id}`.
+- adjacency 정책: 어떤 edge kind 가 어떤 그래프에 포함되는가.
+- HBM slice 크기로부터 PE id 복원 방법.
+- Dijkstra의 weight 결정 (`routing_weight_mm or distance_mm`).
+
+이 단일 소유자 원칙이 깨지면 (예: 컴포넌트가 자체적으로 `f"sip{s}..."` 를
+구성하기 시작하면) 명명 규칙 변경 시 영향 범위가 폭발한다. ADR-0015 D4 의
+정신과 정렬.
+
+### D6. helper API consumer 의 목록
+
+본 helper 가 노출하는 메소드를 호출하는 곳을 명시 (현재 코퍼스 기준):
+
+- `probes/probe.py` (ADR-0049): `find_pcie_ep`, `find_io_cpu`,
+  `find_m_cpu`, `find_node_path`, `find_mcpu_dma_path`,
+  `find_memory_path`, `find_path`, `resolve`.
+- `runtime_api/distributed.py` (ADR-0047): 간접 (engine 내부 routing).
+- `ccl/install.py` (ADR-0023): `find_all_pcie_eps`, `resolve`.
+- `sim_engine/event_log.py`: probe 와 유사하게 `find_pcie_ep`,
+  `find_memory_path`.
+- `components/builtin/m_cpu.py`, `components/builtin/io_cpu.py`:
+  `find_node_path`, `find_mcpu_dma_path`.
+- 각종 tests (test_routing.py, test_cross_sip_routing.py 등): D3.1–D3.5
+  대부분.
+
+새 consumer 가 추가될 때 본 ADR 의 D1/D3 가 그 의도에 맞는 메소드가
+이미 있는지 / 새 메소드를 추가해야 하는지 1차 판단의 기준이 된다.
+
+## Alternatives Considered
+
+### A1. 단일 adjacency graph + edge-kind filter 동적 적용
+
+기각. 매 `find_*()` 마다 graph filtering 을 다시 하면 Dijkstra 의 cache
+locality 와 성능이 떨어진다. 4 개 그래프 동시 구축 (D2) 은 메모리 비용
+이 작고 (edge ≤ 수만 건 규모), 호출 시점에 정책 선택이 O(1) 로 결정.
+
+### A2. adjacency 분리를 edge 의 `kind` 가 아닌 별도 metadata 로
+
+기각. edge `kind` 는 이미 topology builder 가 부여하며 (ADR-0015 D4 +
+ADR-0017), 별도 metadata 를 도입하면 두 시스템이 동기화되어야 하는
+중복이 생긴다.
+
+### A3. Dijkstra 대신 BFS + uniform weight
+
+기각. routing_weight_mm 이 edge 별로 다른 (mesh link / UCIe / IO-internal)
+현실에서 BFS 는 hop 수 최소화일 뿐 latency / distance 최단을 보장하지
+않는다. SPEC R1 + R2 의 결정적·정확한 routing 요구에 어긋남.
+
+### A4. helper API 를 클래스 메서드가 아닌 모듈 함수로
+
+기각. 두 클래스 (`AddressResolver`, `PathRouter`) 가 각각 cache 상태
+(`_node_ids`, `_hbm_slice_bytes`, 4 adjacency graphs) 를 보유해야 하며,
+같은 graph 인스턴스에 여러 routing 질의가 발생한다. 모듈 함수는 매 호출
+시 state 를 다시 만들거나 global 로 두어야 해서 안전성/성능 저하.
+
+## Consequences
+
+- 컴포넌트 / probe / IPCQ install / runtime API 가 모두 router.py 의
+  helper 만 호출하면 명명 규칙 변경 (예: `.io0.` → `.iochiplet0.`) 이
+  단 한 파일 수정으로 끝남 (D5).
+- D2 의 4 그래프 분리가 ADR 에 굳어져, 새 edge kind 가 추가될 때 (예:
+  Inter-die UCIe link 의 새 kind) 어느 그래프에 포함시킬지 결정의 명확
+  한 기준 제공.
+- D3.1 의 cube-local vs cross-cube 분기 (ADR-0017 D7) 가 명시되어, 향후
+  routing 동작을 변경하려는 사람이 어느 adjacency 를 건드려야 할지 안다.
+- D6 의 consumer 목록이 명시되어, helper API 변경 시 PR review 범위가
+  분명. backward-compat shim (D3.6) 의 deprecation 후보가 식별됨.