Files
kernbench2/docs/adr/ADR-0053-dev-topology-builder-algorithms.md
ywkang bd49c93703 adr: add ADR-0050-0053 — close /report's second-pass G4 candidates
Documents four cross-cutting surfaces one layer deeper than the prior
G4 batch:

- 0050 par-ccl-algorithm-module-contract: how to author a new CCL
  algorithm in src/kernbench/ccl/algorithms/. Pairs with ADR-0045's
  bench-module contract. Pins the four required public symbols
  (kernel, kernel_args, TOPO_NAME_TO_KIND constants, kernel alias),
  the 9 + tl standardized kernel signature, the kernel_args tuple
  format, sip_topo_kind dispatch, and the ccl.yaml entry workflow.

- 0051 lat-routing-helper-api: every public method of AddressResolver
  (resolve, find_m_cpu, find_pcie_ep, find_io_cpu, find_all_pcie_eps)
  and PathRouter (find_path, find_path_with_distance,
  find_mcpu_dma_path, find_memory_path, find_node_path + 2 shims).
  Pins the four adjacency graphs (_adj_all / _adj / _adj_mcpu_dma /
  _adj_local) and the edge-kind exclusion sets they use, plus the
  single-owner naming convention.

- 0052 dev-oplog-memory-store-schemas: OpRecord's 7 fields, the
  per-op_name params matrix (dma_read, dma_write, gemm_*, math, math
  reduction, composite_gemm, ipcq_copy, unknown), snapshot timing
  rules (math = all inputs, dma_write = HBM-only — ADR-0027 race
  avoidance), TileToken stage_type capture, and MemoryStore's
  (space, addr) two-level dict with reference-store semantics.

- 0053 dev-topology-builder-algorithms: the 6-stage compile pipeline,
  cube_mesh.yaml's source_hash cache and its 5 input fields, the
  cube NoC auto-layout algorithm (row/col placement, HBM exclusion
  zone, PE/M_CPU/SRAM attachment via nearest-router, UCIe N/S/E/W
  distribution), the node naming convention (single-owner with
  router.py), the edge-kind catalog, the 4 view projections, and a
  table of spec-field changes vs mesh regeneration.

Bilingual pair verifier passes for all four EN/KO pairs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:52:42 -07:00

352 lines
14 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0053: Topology Builder + Visualizer Algorithms
## Status
Accepted (2026-05-22).
Pins down the key algorithmic choices of the topology compile and
visualization pipeline jointly implemented by `topology/builder.py`,
`topology/mesh_gen.py`, and `topology/visualizer.py`
placement-driven router attachment, mesh auto-layout, the source_hash
cache, view projections, and SVG rendering. ADR-0006 defines the
high-level intent of topology compilation (compiled topology, distance
extraction, automatic diagram generation), but **which algorithms the
builder actually uses** was only discoverable via source grep.
## First action
When `resolve_topology(path_str)` is called, four steps run in order:
1. **Path validation** (`builder.py::resolve_topology`):
`Path(path_str).expanduser().resolve()`, existence check, file
check. Failure → `FileNotFoundError` or `ValueError`.
2. **YAML parsing** (`_read_spec`): `yaml.safe_load`. Parse errors
yield a `ValueError` with line/column. Non-dict roots are
rejected.
3. **Auto-generate the mesh** (`mesh_gen.ensure_mesh_file`): create or
reuse a `cube_mesh.yaml` next to the topology file. Cache hit on
matching source_hash; miss triggers regeneration. This step decides
the cube NoC's router grid and attachment information.
4. **Compile the graph** (`_compile_graph`): system → IO chiplets →
cubes → inter-cube edges → IO↔cube edges → system↔IO edges, then
build four view projections (system, sip, cube, pe) and wrap into
a `TopologyGraph`.
In short, **topology compilation's first act is "read topology.yaml as
a dict, create/validate cube_mesh.yaml in the same directory, then
build the flat graph + 4-view projection in system → sip → cube → pe
order"**.
## Context
`topology/` package responsibilities:
- **builder.py** (1207 lines): turns topology.yaml into a
`TopologyGraph` (nodes + edges + 4 view projections).
- **mesh_gen.py** (305 lines): auto-decides the cube NoC's router
grid and PE/UCIe/M_CPU/SRAM attachment positions and caches them in
`cube_mesh.yaml`.
- **visualizer.py** (887 lines): generates four SVG diagrams (system /
sip / cube / pe) from a `TopologyGraph`.
ADR-0006 makes the high-level decision that "the result of topology
compilation is the single source for distance metadata and diagram
generation", but specific algorithms (e.g., placement-driven nearest-
router attachment, the HBM exclusion zone, which fields in source_hash
trigger regeneration) are not in any ADR.
In particular, these decisions are absent at ADR level:
- Why is mesh_gen cached in a separate file (`cube_mesh.yaml`)?
- Which fields are in source_hash, and which changes force
regeneration?
- Why placement coordinates in mm rather than cube coordinates?
- How are the HBM exclusion zone and UCIe N/S/E/W distribution
decided inside the mesh?
- What is the abstraction-level difference among the four view
projections (system/sip/cube/pe)?
This ADR captures these decisions in one place.
## Decision
### D1. Compile pipeline — six stages
`_compile_graph(spec)`:
1. **System nodes** (`_instantiate_system`): add system-level nodes
like `fabric.switch0` and the host CPU.
2. **Per-SIP loop** (`for sip_id in range(system.sips.count)`):
- **IO chiplets** (`_instantiate_io_chiplets`): create pcie_ep /
io_cpu / io_noc / io_ucie PHYs / conn nodes and their bidirectional
internal edges.
- **Cube instantiation** (`_instantiate_cube`): using
cube_mesh.yaml's router grid, instantiate cube routers, PE
sub-components (pe_cpu, pe_dma, pe_fetch_store, pe_gemm, pe_math,
pe_mmu, pe_tcm, pe_scheduler, pe_ipcq), m_cpu, sram, hbm_ctrl,
and their internal edges.
- **Inter-cube edges** (`_add_inter_cube_edges`): the UCIe
N/S/E/W mesh edges.
- **IO ↔ cube edges** (`_add_io_to_cube_edges`): connect io_noc to
each cube's edge UCIe phy.
3. **Switch ↔ IO edges** (`_add_system_to_io_edges`): bidirectional
edges between `fabric.switch0` and each SIP's `pcie_ep` (the
cross-SIP IPCQ path of ADR-0038 D3 + ADR-0010).
4. **Build four view projections**:
- `_build_system_view(spec)` — Tray level: SIPs and the system
switch.
- `_build_sip_view(spec)` — inside one SIP: cube mesh + IO
chiplet.
- `_build_cube_view(spec)` — inside one cube: router grid + PE /
M_CPU / SRAM / HBM_CTRL attachments.
- `_build_pe_view(spec)` — inside one PE: nine sub-components +
internal edges (pe_internal kind).
5. **Return `TopologyGraph`**: `TopologyGraph(spec, nodes, edges,
system_view, sip_view, cube_view, pe_view)`.
The six stages are **ordered for a reason**: only after cubes exist
do inter-cube edges have valid src/dst, and IO chiplets must precede
the IO ↔ cube edges that reference them. New node types must slot in
the right spot.
### D2. `cube_mesh.yaml` — a separate file with a source_hash cache
`mesh_gen.ensure_mesh_file(cube_spec, mesh_path)`:
1. Compute `source_hash = _compute_source_hash(cube_spec)` from these
input fields:
- `geometry` (cube_mm.w/h …).
- `pe_layout` (corners, pe_per_corner).
- `ucie.n_connections`.
- `memory_map.hbm_mapping_mode`.
- `placement` (m_cpu/sram pos_mm).
2. If `mesh_path` (= `cube_mesh.yaml` next to topology.yaml) exists
and `existing.source_hash == source_hash`, reuse it (cache hit).
3. Otherwise, generate a new mesh via
`_generate_mesh(cube_spec, source_hash)` and write to yaml.
Caching as a separate file because:
- Mesh generation involves nontrivial PE/UCIe/router attachment math
and is too expensive to redo every time.
- Multiple runs with the same cube spec must guarantee an identical
mesh.
- The resulting mesh is itself an inspectable / debuggable artifact.
The five fields listed in source_hash are the ones that determine
mesh shape; other changes (e.g., bandwidth, overhead_ns) do not
trigger mesh regeneration.
### D3. Cube NoC mesh auto-layout
`_generate_mesh(cube_spec)`:
#### D3.1. Rows / columns
- `pe_positions = _corner_pe_positions(cube_w, cube_h)`: PE-center
coordinates (mm) per corner (NW/NE/SW/SE). Hardcoded patterns like
`(1.5, 1.5)` and `(cube_w-1.5, cube_h-1.5)`; with `pe_per_corner=2`,
each corner has two PE positions.
- `col_xs = _compute_col_positions(...)`: union of PE x-coordinates,
plus relay columns inserted when any gap exceeds
`max_spacing = 3.0 mm`.
- `row_ys, rows_per_half = _compute_row_positions(cube_h,
n_connections, pe_positions)`:
- `n_conn = max(n_connections, 2)` (hot-path minimum).
- `rows_per_half = ceil(n_conn / 2)`.
- Top half + two HBM rows + bottom half. HBM sits at
`(cube_h/2 - 1.5, cube_h/2 + 1.5)`. The gap between PE rows and
HBM rows is `hbm_gap = 1.5 mm`.
#### D3.2. HBM exclusion zone
`hbm_row_start = rows_per_half`,
`hbm_row_end = rows_per_half + 1`.
`hbm_col_start = n_cols // 2 - 1`,
`hbm_col_end = n_cols // 2`.
Router slots inside this (row, col) rectangle are marked `None` (no
router). HBM controllers are added separately as
`hbm_ctrl.pe{X}` nodes following ADR-0017 D9's per-PE partition
pattern.
#### D3.3. PE attachment
Each corner's PEs map to a row:
- Top half: NW → row 0, NE → row 1 (top_corners index).
- Bottom half: SW → row `hbm_row_end + 1`, SE → row
`hbm_row_end + 2`.
Each PE's x-coordinate attaches to the nearest column's router
(`min(range(n_cols), key=lambda c: abs(col_xs[c] - pe_x))`).
Attachment items are `pe{pe_idx}.dma`, `pe{pe_idx}.cpu`,
`pe{pe_idx}.hbm` (pushed into the router's attach list).
#### D3.4. M_CPU / SRAM attachment — nearest router by Euclidean distance
For `placement.m_cpu.pos_mm` (default `[1.5, 5.5]`) and
`placement.sram.pos_mm` (default `[1.5, 8.5]`), find the router with
the smallest Euclidean distance and append `"m_cpu"` / `"sram"` to
its attach list.
#### D3.5. UCIe N/S/E/W distribution
`ucie_pe_rows = top_pe_rows + bot_pe_rows` (total
`2 * rows_per_half`).
- UCIe-E: one PE row at a time, attach `ucie_e.c{i}` to the rightmost
column's router.
- UCIe-W: attach `ucie_w.c{i}` to the leftmost column's router (E's
mirror).
- UCIe-N/S: split PE columns into left and right halves; attach to
the top row's / bottom row's matching columns.
Each UCIe connection is suffixed `c{i}`, distributing
ucie_n_connections PHYs (ADR-0017 D5+).
### D4. Node naming convention — single ownership
builder.py creates nodes with the following naming convention (the
single-owner principle from ADR-0051 D5):
- `fabric.switch0` — system-level switch.
- `sip{S}.{io_id}.{pcie_ep|io_cpu|io_noc|io_ucie.{dir}|conn.{id}}` —
IO chiplet.
- `sip{S}.cube{C}.{m_cpu|sram|hbm_ctrl.pe{X}|noc.r{R}c{C}|...}` —
inside cube.
- `sip{S}.cube{C}.pe{P}.{pe_cpu|pe_dma|pe_fetch_store|pe_gemm|pe_math|pe_mmu|pe_tcm|pe_scheduler|pe_ipcq}` —
PE sub-components.
Changing this convention requires updating both builder.py and
router.py's helpers (ADR-0051). Components never know the convention
directly — they only call the helpers.
### D5. Edge `kind` classification
Every edge gets a `kind`; routing policy (ADR-0051 D2) reads it. Major
kinds:
- `"pe_internal"` — within a PE between sub-components.
- `"pe_to_router"` — PE_DMA ↔ cube NoC router.
- `"router_mesh"` — between cube NoC routers.
- `"router_to_hbm"`, `"router_to_mcpu"`, `"router_to_sram"`,
`"sram_to_router"`, etc. — between cube-attached components.
- `"ucie_internal"`, `"ucie_conn_to_router"`,
`"router_to_ucie_conn"`, `"ucie_conn_to_noc"`,
`"noc_to_ucie_conn"`, `"ucie_mesh"` — UCIe-related.
- `"io_internal"` — inside IO chiplet.
- `"io_to_cube"`, `"cube_to_io"` — at the IO ↔ cube boundary.
- `"pcie"` — switch ↔ pcie_ep.
- `"command"` — control-plane edges only (e.g., M_CPU ↔ NOC; excluded
from PE DMA paths).
Adding a new edge kind requires picking a category in router.py's
four adjacency graphs (ADR-0051 D2). If you forget, it defaults to
`_adj_all` only, which can produce unintended routes.
### D6. View projection — four abstraction levels
`TopologyGraph` keeps four view projections alongside the flat
nodes+edges:
- **system_view** (`_build_system_view`): Tray level. SIP blocks and
`fabric.switch0`. PCIe links shown. For external high-level
overview.
- **sip_view** (`_build_sip_view`): inside one SIP — cube mesh + IO
chiplet (pcie_ep + io_cpu + io_noc). UCIe N/S/E/W appear as
cube-cube links.
- **cube_view** (`_build_cube_view`): inside one cube — router grid +
PE / M_CPU / SRAM / HBM_CTRL attachments + UCIe PHY edges. For
intra-cube routing / placement debugging.
- **pe_view** (`_build_pe_view`): inside one PE — nine sub-components
+ internal edges (pe_internal kind). For detailed PE-internal
dataflow review.
Views are selectively rendered via the spec's
`visualization.emit_views: [system, sip, cube]` (ADR-0006). The pe
view is omitted from default output but the code is retained for
detailed debugging.
### D7. visualizer.py — SVG diagram output
`emit_diagrams(graph, out_dir)` renders every view as SVG. Key
functions:
- `_render_view_svg(view)` — generic view render (no router grid).
- `_render_cube_view_svg(view, spec)` — cube-view specific (HBM block,
router grid layout, PE/M_CPU/SRAM/HBM placement).
- `_draw_node`, `_draw_edge` — node/edge visual representation.
- `_pick_scale`, `_compute_node_sizes` — auto-scaling.
The visualizer is a **derived artifact** (ADR-0006); changes here do
not pass production checks. Aligns with CLAUDE.md's "Derived
Artifacts" guidance.
### D8. Blast radius of spec changes
| spec field | effect | mesh regenerated? |
|---------------------------------------|---------------------|-------------------|
| `system.sips.count` | SIP count, node count | No |
| `sip.cube_mesh.w/h` | cube mesh shape | No |
| `cube.geometry.cube_mm.w/h` | cube size (mm) | **Yes** |
| `cube.pe_layout.corners/pe_per_corner`| PE attachment positions | **Yes** |
| `cube.ucie.n_connections` | UCIe PHY distribution | **Yes** |
| `cube.memory_map.hbm_mapping_mode` | HBM distribution mode | **Yes** |
| `cube.placement` | M_CPU/SRAM positions | **Yes** |
| `cube.memory_map.*` (besides above) | HBM capacity / BW | No |
| `*.links.*.bw_gbs` | edge bandwidth | No |
| `*.attrs.overhead_ns` | component latency | No |
The table mirrors D2's `_compute_source_hash` inputs. Changes that
require mesh regeneration automatically invalidate `cube_mesh.yaml`'s
source_hash.
## Alternatives Considered
### A1. Regenerate the mesh on every compile without a cache file
Rejected. The cost of mesh generation would be paid repeatedly (CLI
runs, probe, tests) for the same spec, and the human-inspectable
artifact would disappear.
### A2. Merge mesh generation into builder.py
Rejected (currently). It is a 305-line algorithm of its own, and the
mesh-layout decisions (placement-driven router attachment, HBM
exclusion zone) are different from builder's general node/edge
emission. Keeping it separate respects single-responsibility.
### A3. Express placement coordinates in cube coordinates (col/row)
Rejected. mm coordinates flow consistently between the visualizer and
mesh layout (for nearest-router computation). Cube coordinates are
undefined until the router grid is fixed, so they are unsuitable as
placement input.
### A4. Lazy view projection generation
Rejected (currently). The four views are cheap to build (typically <
100 ms), and eager construction guarantees `TopologyGraph` as the
single source of truth.
### A5. Visualizer output in formats besides SVG (PNG/PDF)
Rejected. SVG is vector + text-searchable + directly renderable in
browsers. PNG conversion, when required, is downstream
post-processing (e.g., rsvg-convert).
## Consequences
- ADR-0006's high-level intent is fleshed out via D1D7; topology
changes can be assessed quickly via D8's table.
- D3's mesh-layout algorithm is ADR-locked, so future PE attachment
patterns (e.g., a 6-zone HBM split) make clear which stage they
affect.
- D5's edge-kind list and D7's view structure are explicit, giving PR
reviewers a quick map of where (builder + router + visualizer) a
new component type ripples through.
- D2's source_hash invalidation rules are explicit, so a stale
`cube_mesh.yaml` (e.g., when only bandwidth changed) is recognized
as correct behavior.