Files
kernbench2/docs/adr/ADR-0051-lat-routing-helper-api.md
ywkang bd49c93703 adr: add ADR-0050-0053 — close /report's second-pass G4 candidates
Documents four cross-cutting surfaces one layer deeper than the prior
G4 batch:

- 0050 par-ccl-algorithm-module-contract: how to author a new CCL
  algorithm in src/kernbench/ccl/algorithms/. Pairs with ADR-0045's
  bench-module contract. Pins the four required public symbols
  (kernel, kernel_args, TOPO_NAME_TO_KIND constants, kernel alias),
  the 9 + tl standardized kernel signature, the kernel_args tuple
  format, sip_topo_kind dispatch, and the ccl.yaml entry workflow.

- 0051 lat-routing-helper-api: every public method of AddressResolver
  (resolve, find_m_cpu, find_pcie_ep, find_io_cpu, find_all_pcie_eps)
  and PathRouter (find_path, find_path_with_distance,
  find_mcpu_dma_path, find_memory_path, find_node_path + 2 shims).
  Pins the four adjacency graphs (_adj_all / _adj / _adj_mcpu_dma /
  _adj_local) and the edge-kind exclusion sets they use, plus the
  single-owner naming convention.

- 0052 dev-oplog-memory-store-schemas: OpRecord's 7 fields, the
  per-op_name params matrix (dma_read, dma_write, gemm_*, math, math
  reduction, composite_gemm, ipcq_copy, unknown), snapshot timing
  rules (math = all inputs, dma_write = HBM-only — ADR-0027 race
  avoidance), TileToken stage_type capture, and MemoryStore's
  (space, addr) two-level dict with reference-store semantics.

- 0053 dev-topology-builder-algorithms: the 6-stage compile pipeline,
  cube_mesh.yaml's source_hash cache and its 5 input fields, the
  cube NoC auto-layout algorithm (row/col placement, HBM exclusion
  zone, PE/M_CPU/SRAM attachment via nearest-router, UCIe N/S/E/W
  distribution), the node naming convention (single-owner with
  router.py), the edge-kind catalog, the 4 view projections, and a
  table of spec-field changes vs mesh regeneration.

Bilingual pair verifier passes for all four EN/KO pairs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:52:42 -07:00

12 KiB
Raw Permalink Blame History

ADR-0051: Routing Helper API — AddressResolver + PathRouter

Status

Accepted (2026-05-22).

Pins down every public API, argument, return value, and adjacency-graph selection of the two helper classes (AddressResolver, PathRouter) exposed by policy/routing/router.py. ADR-0002 defines routing distance, ordering, and bypass rules, but the helper API surface itself has had no ADR-level coverage.

First action

AddressResolver(graph)

On construction, caches two pieces of state:

  1. self._node_ids = set(graph.nodes) — a set of all node ids for lookup.
  2. self._hbm_slice_bytes = hbm_total_gb * (1 << 30) // slices_per_cube — derived from graph.spec.cube.memory_map (default 48 GB / 8 slices = 6 GB). resolve() uses this value to decode pe_id from an HBM PA's hbm_offset.

In short, AddressResolver's first act is "precompute the full set of node ids and the HBM slice size". It does not retain the graph itself.

PathRouter(graph)

On construction, builds four separate adjacency graphs in one pass:

  1. self._adj_all: every edge (used for component-to-component routing).
  2. self._adj: edges with kind != "command" (PE DMA / generic data paths).
  3. self._adj_mcpu_dma: excludes _MCPU_DMA_EXCLUDE = {"pe_internal", "pe_to_router"} (M_CPU DMA must not pass through PE pipeline nodes).
  4. self._adj_local: excludes the 8-element _UCIE_KINDS set (UCIe would look like a zero-distance bus to Dijkstra, which would prefer it over the mesh — for cube-local routing this must be avoided).

Each graph is a defaultdict(list) of (neighbor, weight). The weight is edge.routing_weight_mm or edge.distance_mm.

In short, PathRouter's first act is "classify topology edges into four policy-specific adjacency lists simultaneously". Each find_*() call picks the appropriate graph and runs Dijkstra.

Context

policy/routing/router.py performs two responsibilities together:

  • Naming: it is the sole owner of the topology naming convention (sip{S}.cube{C}.<comp>, sip{S}.io{I}.pcie_ep, etc.). Components / probe / IPCQ install / runtime API do not build node-id strings themselves — they call helpers.
  • Path decisions: policy separation by edge.kind. For the same src→dst, different routing intents (PE DMA vs M_CPU DMA vs general component routing) call for different adjacencies and so produce different paths.

This helper API is widely consumed (probe.py / distributed.py / install.py / various components / tests), yet the exact signatures / return semantics / adjacency picks are not gathered in any ADR. This ADR closes that gap.

Decision

D1. AddressResolver exposes five public methods

D1.1. resolve(addr: PhysAddr) -> str

Translates a PhysAddr to a destination node id in the topology:

addr.kind == "hbm"             → f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
  where pe_id = addr.hbm_offset // self._hbm_slice_bytes  (ADR-0017 D4/D9)

addr.kind == "pe_resource":
  addr.unit_type == PE         → f"sip{s}.cube{d}.pe{addr.pe_id}.pe_tcm"
  addr.unit_type == SRAM       → f"sip{s}.cube{d}.sram"
  addr.unit_type == MCPU       → f"sip{s}.cube{d}.m_cpu"
  others                       → RoutingError("unsupported unit_type")

other kinds                    → RoutingError("unsupported address kind")

If the derived node id is not in self._node_ids, raises RoutingError(f"node {node_id} not found in topology"). So even when the address has valid syntax, an absent node in the topology fails-loud.

D1.2. find_m_cpu(sip, cube) -> str

Returns f"sip{sip}.cube{cube}.m_cpu"; absent → RoutingError.

D1.3. find_pcie_ep(sip, io_id="io0") -> str

Returns f"sip{sip}.{io_id}.pcie_ep"; absent → RoutingError.

D1.4. find_io_cpu(sip, io_id="io0") -> str

Returns f"sip{sip}.{io_id}.io_cpu"; absent → RoutingError.

D1.5. find_all_pcie_eps() -> list[str]

All PCIE_EP node ids across all SIPs, sorted. Filtered by endswith(".pcie_ep"). Cross-SIP IPCQ uses this when enumerating PCIE_EPs.

This class is the sole owner of the naming convention (sip{S}.cube{C}.<comp>, sip{S}.{io_id}.<comp>) — ADR-0015 D4. The topology builder produces nodes with the same naming convention; components never build node-id strings directly — they go through these helpers.

D2. PathRouter's four adjacency graphs

Constructed in one pass. edge.kind drives policy:

graph excluded edge kinds use case
_adj_all (none) M_CPU↔NOC command included, IO_CPU/M_CPU routes
_adj "command" PE DMA / generic data paths
_adj_mcpu_dma "pe_internal", "pe_to_router" M_CPU DMA (skips PE pipeline)
_adj_local _UCIE_KINDS (ucie_internal, ucie_conn_to_router, router_to_ucie_conn, ucie_conn_to_noc, noc_to_ucie_conn, ucie_mesh, io_to_cube, cube_to_io) same-cube routing (UCIe bus excluded)

Each graph is dict[node_id, list[(neighbor, weight)]] with weight = edge.routing_weight_mm or edge.distance_mm. Excluding command edges prevents them from influencing routing; isolating _adj_local keeps UCIe's "zero-distance bus" from out-competing the mesh — consistent with ADR-0017 D7's cross-PE-slice mesh-distance requirement.

D3. PathRouter exposes six public methods (+ two backward-compat shims)

D3.1. find_path(src_pe: str, dst_node: str) -> list[str]

PE DMA routing. src_pe is a PE prefix (e.g., "sip0.cube0.pe0"); the function auto-prepends .pe_dma, making the true start node "sip0.cube0.pe0.pe_dma".

Adjacency depends on cube-locality (_same_cube):

  • Same-cube (src and dst share sip{S}.cube{C}. prefix): uses _adj_local. Excluding UCIe lets cross-PE-slice access pay accurate mesh distance (ADR-0017 D7).
  • Cross-cube: uses _adj. UCIe naturally becomes the right choice for the cross-cube portion.

D3.2. find_path_with_distance(src_pe, dst_node) -> tuple[list[str], float]

Same adjacency policy as D3.1, but returns (path, total_distance). Used by probe and analysis tools that need the distance metric.

D3.3. find_mcpu_dma_path(m_cpu_id: str, dst_hbm_id: str) -> list[str]

M_CPU DMA path. Same cube → _adj_local (stay within the mesh); different cube → _adj_all (cross via UCIe). The _MCPU_DMA_EXCLUDE set ensures PE-pipeline nodes never appear on M_CPU's routes.

D3.4. find_memory_path(src: str, dst: str) -> list[str]

Direct memory path like pcie_ep → io_noc → cube → router mesh → hbm_ctrl. Uses _adj_mcpu_dma to exclude pe_internal and pe_to_router, so host-issued reads/writes never leak into the PE pipeline. Probe (ADR-0049 D1's H2D/D2H cases) calls this directly.

D3.5. find_node_path(src: str, dst: str) -> list[str]

Generic routing between arbitrary nodes, including command edges (via _adj_all). IoCpuComponent / MCpuComponent use this when they need to route through M_CPU ↔ NOC command-kind links.

D3.6. Backward-compat shims

  • _dijkstra(start, goal) -> list[str] — thin wrapper for _run_dijkstra(self._adj, …).
  • _dijkstra_with_dist(start, goal) -> tuple[list[str], float] — distance-aware variant.

Despite the underscore prefixes (suggesting internal API), existing tests call these directly. New code should prefer D3.1D3.5; these two shims are deprecation candidates.

D4. Dijkstra — single-source shortest path

_run_dijkstra_with_dist(adj, start, goal):

  • heapq priority queue.
  • best: dict[node, distance] — best known distance to each node.
  • prev: dict[node, predecessor] — for path reconstruction.
  • Edge weight = routing_weight_mm or distance_mm. The separation matters because UCIe (and a few others) declare an explicit routing_weight_mm distinct from physical distance_mm.

start == goal short-circuits to ([start], 0.0). Unreachable target → RoutingError(f"no path from {start} to {goal}").

The algorithm is deterministic: identical graph + start/goal gives the same path, satisfying SPEC R1 ("routing MUST be deterministic"). Tie-breaks follow heapq's push order (Python list order is deterministic).

D5. Single-owner principle for helper-API decisions

The following decisions live only inside router.py:

  • Naming convention: sip{S}.cube{C}.<comp>, sip{S}.{io_id}.<comp>, sip{S}.cube{C}.hbm_ctrl.pe{pe_id}.
  • Adjacency policy: which edge kinds belong to which graph.
  • Algorithm for recovering PE id from an HBM slice size.
  • Dijkstra weight selection (routing_weight_mm or distance_mm).

Breaking single ownership (e.g., a component starting to build f"sip{s}..." itself) would explode the blast radius of naming- convention changes. This aligns with ADR-0015 D4.

D6. Consumers of the helper API

Methods listed in this ADR are called from (current corpus):

  • probes/probe.py (ADR-0049): find_pcie_ep, find_io_cpu, find_m_cpu, find_node_path, find_mcpu_dma_path, find_memory_path, find_path, resolve.
  • runtime_api/distributed.py (ADR-0047): indirectly (engine-internal routing).
  • ccl/install.py (ADR-0023): find_all_pcie_eps, resolve.
  • sim_engine/event_log.py: like probe — find_pcie_ep, find_memory_path.
  • components/builtin/m_cpu.py, components/builtin/io_cpu.py: find_node_path, find_mcpu_dma_path.
  • Tests (test_routing.py, test_cross_sip_routing.py, …): most of D3.1D3.5.

When a new consumer arrives, D1/D3 act as a first-pass guide on whether an existing method matches the intent or a new one is needed.

Alternatives Considered

A1. One adjacency graph + per-call edge-kind filtering

Rejected. Re-filtering the graph on every find_*() call hurts Dijkstra cache locality. Constructing four graphs in one pass (D2) has modest memory cost (edges ≤ a few × 10⁴), and selection happens in O(1) at call time.

A2. Drive adjacency separation by separate edge metadata rather than kind

Rejected. edge.kind is already assigned by the topology builder (ADR-0015 D4 + ADR-0017); a parallel metadata field would force synchronization between two systems.

A3. Use BFS with uniform weights instead of Dijkstra

Rejected. With per-edge routing_weight_mm (mesh link / UCIe / IO-internal), BFS minimizes hop count rather than total latency/distance. SPEC R1 + R2 require deterministic and accurate routing, which BFS does not deliver.

A4. Express the helper API as module functions instead of classes

Rejected. Each class (AddressResolver, PathRouter) maintains caches (_node_ids, _hbm_slice_bytes, four adjacency graphs) reused across many routing queries on the same graph. Module functions would have to rebuild state per call or go global, hurting safety and performance.

Consequences

  • When components / probe / IPCQ install / runtime API all go through router.py helpers, a naming-convention change (e.g., .io0..iochiplet0.) is a one-file edit (D5).
  • D2's four-graph split is now ADR-locked, so when a new edge kind is added (e.g., a new inter-die UCIe-link kind), the right adjacency category is decided explicitly rather than by default.
  • D3.1's same-cube vs cross-cube branching (ADR-0017 D7) is explicit, so anyone changing routing knows which adjacency to touch.
  • D6's consumer list bounds PR-review scope for helper-API changes, and the backward-compat shims (D3.6) are flagged as deprecation candidates.