Files
kernbench2/docs/adr/ADR-0051-lat-routing-helper-api.md
T
ywkang bd49c93703 adr: add ADR-0050-0053 — close /report's second-pass G4 candidates
Documents four cross-cutting surfaces one layer deeper than the prior
G4 batch:

- 0050 par-ccl-algorithm-module-contract: how to author a new CCL
  algorithm in src/kernbench/ccl/algorithms/. Pairs with ADR-0045's
  bench-module contract. Pins the four required public symbols
  (kernel, kernel_args, TOPO_NAME_TO_KIND constants, kernel alias),
  the 9 + tl standardized kernel signature, the kernel_args tuple
  format, sip_topo_kind dispatch, and the ccl.yaml entry workflow.

- 0051 lat-routing-helper-api: every public method of AddressResolver
  (resolve, find_m_cpu, find_pcie_ep, find_io_cpu, find_all_pcie_eps)
  and PathRouter (find_path, find_path_with_distance,
  find_mcpu_dma_path, find_memory_path, find_node_path + 2 shims).
  Pins the four adjacency graphs (_adj_all / _adj / _adj_mcpu_dma /
  _adj_local) and the edge-kind exclusion sets they use, plus the
  single-owner naming convention.

- 0052 dev-oplog-memory-store-schemas: OpRecord's 7 fields, the
  per-op_name params matrix (dma_read, dma_write, gemm_*, math, math
  reduction, composite_gemm, ipcq_copy, unknown), snapshot timing
  rules (math = all inputs, dma_write = HBM-only — ADR-0027 race
  avoidance), TileToken stage_type capture, and MemoryStore's
  (space, addr) two-level dict with reference-store semantics.

- 0053 dev-topology-builder-algorithms: the 6-stage compile pipeline,
  cube_mesh.yaml's source_hash cache and its 5 input fields, the
  cube NoC auto-layout algorithm (row/col placement, HBM exclusion
  zone, PE/M_CPU/SRAM attachment via nearest-router, UCIe N/S/E/W
  distribution), the node naming convention (single-owner with
  router.py), the edge-kind catalog, the 4 view projections, and a
  table of spec-field changes vs mesh regeneration.

Bilingual pair verifier passes for all four EN/KO pairs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:52:42 -07:00

289 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0051: Routing Helper API — `AddressResolver` + `PathRouter`
## Status
Accepted (2026-05-22).
Pins down every public API, argument, return value, and adjacency-graph
selection of the two helper classes (`AddressResolver`, `PathRouter`)
exposed by `policy/routing/router.py`. ADR-0002 defines routing
distance, ordering, and bypass rules, but **the helper API surface
itself** has had no ADR-level coverage.
## First action
### `AddressResolver(graph)`
On construction, caches two pieces of state:
1. `self._node_ids = set(graph.nodes)` — a set of all node ids for
lookup.
2. `self._hbm_slice_bytes = hbm_total_gb * (1 << 30) // slices_per_cube`
— derived from `graph.spec.cube.memory_map` (default `48 GB / 8
slices = 6 GB`). `resolve()` uses this value to decode `pe_id` from
an HBM PA's `hbm_offset`.
In short, **AddressResolver's first act is "precompute the full set of
node ids and the HBM slice size"**. It does not retain the graph
itself.
### `PathRouter(graph)`
On construction, **builds four separate adjacency graphs in one pass**:
1. `self._adj_all`: every edge (used for component-to-component
routing).
2. `self._adj`: edges with `kind != "command"` (PE DMA / generic data
paths).
3. `self._adj_mcpu_dma`: excludes
`_MCPU_DMA_EXCLUDE = {"pe_internal", "pe_to_router"}` (M_CPU DMA
must not pass through PE pipeline nodes).
4. `self._adj_local`: excludes the 8-element `_UCIE_KINDS` set (UCIe
would look like a zero-distance bus to Dijkstra, which would prefer
it over the mesh — for cube-local routing this must be avoided).
Each graph is a `defaultdict(list)` of `(neighbor, weight)`. The
weight is `edge.routing_weight_mm or edge.distance_mm`.
In short, **PathRouter's first act is "classify topology edges into
four policy-specific adjacency lists simultaneously"**. Each `find_*()`
call picks the appropriate graph and runs Dijkstra.
## Context
`policy/routing/router.py` performs two responsibilities together:
- **Naming**: it is the sole owner of the topology naming convention
(`sip{S}.cube{C}.<comp>`, `sip{S}.io{I}.pcie_ep`, etc.). Components /
probe / IPCQ install / runtime API do not build node-id strings
themselves — they call helpers.
- **Path decisions**: policy separation by `edge.kind`. For the same
src→dst, different routing intents (PE DMA vs M_CPU DMA vs general
component routing) call for different adjacencies and so produce
different paths.
This helper API is widely consumed (probe.py / distributed.py /
install.py / various components / tests), yet **the exact signatures /
return semantics / adjacency picks** are not gathered in any ADR. This
ADR closes that gap.
## Decision
### D1. `AddressResolver` exposes five public methods
#### D1.1. `resolve(addr: PhysAddr) -> str`
Translates a `PhysAddr` to a destination node id in the topology:
```
addr.kind == "hbm" → f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
where pe_id = addr.hbm_offset // self._hbm_slice_bytes (ADR-0017 D4/D9)
addr.kind == "pe_resource":
addr.unit_type == PE → f"sip{s}.cube{d}.pe{addr.pe_id}.pe_tcm"
addr.unit_type == SRAM → f"sip{s}.cube{d}.sram"
addr.unit_type == MCPU → f"sip{s}.cube{d}.m_cpu"
others → RoutingError("unsupported unit_type")
other kinds → RoutingError("unsupported address kind")
```
If the derived node id is not in `self._node_ids`, raises
`RoutingError(f"node {node_id} not found in topology")`. So even when
the address has valid syntax, an absent node in the topology
fails-loud.
#### D1.2. `find_m_cpu(sip, cube) -> str`
Returns `f"sip{sip}.cube{cube}.m_cpu"`; absent → `RoutingError`.
#### D1.3. `find_pcie_ep(sip, io_id="io0") -> str`
Returns `f"sip{sip}.{io_id}.pcie_ep"`; absent → `RoutingError`.
#### D1.4. `find_io_cpu(sip, io_id="io0") -> str`
Returns `f"sip{sip}.{io_id}.io_cpu"`; absent → `RoutingError`.
#### D1.5. `find_all_pcie_eps() -> list[str]`
All PCIE_EP node ids across all SIPs, sorted. Filtered by
`endswith(".pcie_ep")`. Cross-SIP IPCQ uses this when enumerating
PCIE_EPs.
This class is the sole owner of the naming convention
(`sip{S}.cube{C}.<comp>`, `sip{S}.{io_id}.<comp>`) — ADR-0015 D4.
The topology builder produces nodes with the same naming convention;
components never build node-id strings directly — they go through
these helpers.
### D2. `PathRouter`'s four adjacency graphs
Constructed in one pass. `edge.kind` drives policy:
| graph | excluded edge kinds | use case |
|-------------------|--------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|
| `_adj_all` | (none) | M_CPU↔NOC command included, IO_CPU/M_CPU routes |
| `_adj` | `"command"` | PE DMA / generic data paths |
| `_adj_mcpu_dma` | `"pe_internal"`, `"pe_to_router"` | M_CPU DMA (skips PE pipeline) |
| `_adj_local` | `_UCIE_KINDS` (`ucie_internal`, `ucie_conn_to_router`, `router_to_ucie_conn`, `ucie_conn_to_noc`, `noc_to_ucie_conn`, `ucie_mesh`, `io_to_cube`, `cube_to_io`) | same-cube routing (UCIe bus excluded) |
Each graph is `dict[node_id, list[(neighbor, weight)]]` with weight =
`edge.routing_weight_mm or edge.distance_mm`. Excluding command edges
prevents them from influencing routing; isolating `_adj_local` keeps
UCIe's "zero-distance bus" from out-competing the mesh — consistent
with ADR-0017 D7's cross-PE-slice mesh-distance requirement.
### D3. `PathRouter` exposes six public methods (+ two backward-compat shims)
#### D3.1. `find_path(src_pe: str, dst_node: str) -> list[str]`
**PE DMA routing**. `src_pe` is a PE prefix (e.g.,
`"sip0.cube0.pe0"`); the function auto-prepends `.pe_dma`, making the
true start node `"sip0.cube0.pe0.pe_dma"`.
Adjacency depends on cube-locality (`_same_cube`):
- **Same-cube** (src and dst share `sip{S}.cube{C}.` prefix): uses
`_adj_local`. Excluding UCIe lets cross-PE-slice access pay accurate
mesh distance (ADR-0017 D7).
- **Cross-cube**: uses `_adj`. UCIe naturally becomes the right choice
for the cross-cube portion.
#### D3.2. `find_path_with_distance(src_pe, dst_node) -> tuple[list[str], float]`
Same adjacency policy as D3.1, but returns `(path, total_distance)`.
Used by probe and analysis tools that need the distance metric.
#### D3.3. `find_mcpu_dma_path(m_cpu_id: str, dst_hbm_id: str) -> list[str]`
**M_CPU DMA path**. Same cube → `_adj_local` (stay within the mesh);
different cube → `_adj_all` (cross via UCIe). The
`_MCPU_DMA_EXCLUDE` set ensures PE-pipeline nodes never appear on
M_CPU's routes.
#### D3.4. `find_memory_path(src: str, dst: str) -> list[str]`
Direct memory path like
`pcie_ep → io_noc → cube → router mesh → hbm_ctrl`. Uses
`_adj_mcpu_dma` to exclude `pe_internal` and `pe_to_router`, so
host-issued reads/writes never leak into the PE pipeline. Probe
(ADR-0049 D1's H2D/D2H cases) calls this directly.
#### D3.5. `find_node_path(src: str, dst: str) -> list[str]`
Generic routing between arbitrary nodes, **including command edges**
(via `_adj_all`). IoCpuComponent / MCpuComponent use this when they
need to route through M_CPU ↔ NOC command-kind links.
#### D3.6. Backward-compat shims
- `_dijkstra(start, goal) -> list[str]` — thin wrapper for
`_run_dijkstra(self._adj, …)`.
- `_dijkstra_with_dist(start, goal) -> tuple[list[str], float]`
distance-aware variant.
Despite the underscore prefixes (suggesting internal API), existing
tests call these directly. New code should prefer D3.1D3.5; these two
shims are deprecation candidates.
### D4. Dijkstra — single-source shortest path
`_run_dijkstra_with_dist(adj, start, goal)`:
- `heapq` priority queue.
- `best: dict[node, distance]` — best known distance to each node.
- `prev: dict[node, predecessor]` — for path reconstruction.
- Edge weight = `routing_weight_mm or distance_mm`. The separation
matters because UCIe (and a few others) declare an explicit
`routing_weight_mm` distinct from physical `distance_mm`.
`start == goal` short-circuits to `([start], 0.0)`. Unreachable target
`RoutingError(f"no path from {start} to {goal}")`.
The algorithm is **deterministic**: identical graph + start/goal gives
the same path, satisfying SPEC R1 ("routing MUST be deterministic").
Tie-breaks follow `heapq`'s push order (Python list order is
deterministic).
### D5. Single-owner principle for helper-API decisions
The following decisions live only inside router.py:
- Naming convention: `sip{S}.cube{C}.<comp>`,
`sip{S}.{io_id}.<comp>`,
`sip{S}.cube{C}.hbm_ctrl.pe{pe_id}`.
- Adjacency policy: which edge kinds belong to which graph.
- Algorithm for recovering PE id from an HBM slice size.
- Dijkstra weight selection
(`routing_weight_mm or distance_mm`).
Breaking single ownership (e.g., a component starting to build
`f"sip{s}..."` itself) would explode the blast radius of naming-
convention changes. This aligns with ADR-0015 D4.
### D6. Consumers of the helper API
Methods listed in this ADR are called from (current corpus):
- `probes/probe.py` (ADR-0049): `find_pcie_ep`, `find_io_cpu`,
`find_m_cpu`, `find_node_path`, `find_mcpu_dma_path`,
`find_memory_path`, `find_path`, `resolve`.
- `runtime_api/distributed.py` (ADR-0047): indirectly (engine-internal
routing).
- `ccl/install.py` (ADR-0023): `find_all_pcie_eps`, `resolve`.
- `sim_engine/event_log.py`: like probe — `find_pcie_ep`,
`find_memory_path`.
- `components/builtin/m_cpu.py`, `components/builtin/io_cpu.py`:
`find_node_path`, `find_mcpu_dma_path`.
- Tests (test_routing.py, test_cross_sip_routing.py, …): most of
D3.1D3.5.
When a new consumer arrives, D1/D3 act as a first-pass guide on
whether an existing method matches the intent or a new one is needed.
## Alternatives Considered
### A1. One adjacency graph + per-call edge-kind filtering
Rejected. Re-filtering the graph on every `find_*()` call hurts
Dijkstra cache locality. Constructing four graphs in one pass (D2)
has modest memory cost (edges ≤ a few × 10⁴), and selection happens
in O(1) at call time.
### A2. Drive adjacency separation by separate edge metadata rather than `kind`
Rejected. `edge.kind` is already assigned by the topology builder
(ADR-0015 D4 + ADR-0017); a parallel metadata field would force
synchronization between two systems.
### A3. Use BFS with uniform weights instead of Dijkstra
Rejected. With per-edge `routing_weight_mm` (mesh link / UCIe /
IO-internal), BFS minimizes hop count rather than total
latency/distance. SPEC R1 + R2 require deterministic and accurate
routing, which BFS does not deliver.
### A4. Express the helper API as module functions instead of classes
Rejected. Each class
(`AddressResolver`, `PathRouter`) maintains caches
(`_node_ids`, `_hbm_slice_bytes`, four adjacency graphs) reused across
many routing queries on the same graph. Module functions would have
to rebuild state per call or go global, hurting safety and
performance.
## Consequences
- When components / probe / IPCQ install / runtime API all go through
router.py helpers, a naming-convention change (e.g., `.io0.`
`.iochiplet0.`) is a one-file edit (D5).
- D2's four-graph split is now ADR-locked, so when a new edge kind is
added (e.g., a new inter-die UCIe-link kind), the right adjacency
category is decided explicitly rather than by default.
- D3.1's same-cube vs cross-cube branching (ADR-0017 D7) is explicit,
so anyone changing routing knows which adjacency to touch.
- D6's consumer list bounds PR-review scope for helper-API changes,
and the backward-compat shims (D3.6) are flagged as deprecation
candidates.