adr: add ADR-0050-0053 — close /report's second-pass G4 candidates

Documents four cross-cutting surfaces one layer deeper than the prior
G4 batch:

- 0050 par-ccl-algorithm-module-contract: how to author a new CCL
  algorithm in src/kernbench/ccl/algorithms/. Pairs with ADR-0045's
  bench-module contract. Pins the four required public symbols
  (kernel, kernel_args, TOPO_NAME_TO_KIND constants, kernel alias),
  the 9 + tl standardized kernel signature, the kernel_args tuple
  format, sip_topo_kind dispatch, and the ccl.yaml entry workflow.

- 0051 lat-routing-helper-api: every public method of AddressResolver
  (resolve, find_m_cpu, find_pcie_ep, find_io_cpu, find_all_pcie_eps)
  and PathRouter (find_path, find_path_with_distance,
  find_mcpu_dma_path, find_memory_path, find_node_path + 2 shims).
  Pins the four adjacency graphs (_adj_all / _adj / _adj_mcpu_dma /
  _adj_local) and the edge-kind exclusion sets they use, plus the
  single-owner naming convention.

- 0052 dev-oplog-memory-store-schemas: OpRecord's 7 fields, the
  per-op_name params matrix (dma_read, dma_write, gemm_*, math, math
  reduction, composite_gemm, ipcq_copy, unknown), snapshot timing
  rules (math = all inputs, dma_write = HBM-only — ADR-0027 race
  avoidance), TileToken stage_type capture, and MemoryStore's
  (space, addr) two-level dict with reference-store semantics.

- 0053 dev-topology-builder-algorithms: the 6-stage compile pipeline,
  cube_mesh.yaml's source_hash cache and its 5 input fields, the
  cube NoC auto-layout algorithm (row/col placement, HBM exclusion
  zone, PE/M_CPU/SRAM attachment via nearest-router, UCIe N/S/E/W
  distribution), the node naming convention (single-owner with
  router.py), the edge-kind catalog, the 4 view projections, and a
  table of spec-field changes vs mesh regeneration.

Bilingual pair verifier passes for all four EN/KO pairs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-22 10:52:42 -07:00
parent 9a02955770
commit bd49c93703
8 changed files with 2566 additions and 0 deletions
@@ -0,0 +1,322 @@
# ADR-0050: CCL Algorithm Module Contract — `ccl/algorithms/*.py`
## Status
Accepted (2026-05-22).
Pins down the interface, kernel signature, and addition workflow that a
module under `src/kernbench/ccl/algorithms/` must satisfy in order to be
used as a collective algorithm by the AHBM CCL backend (ADR-0047).
ADR-0047 D3 states only that "the algorithm module must expose `kernel`,
`kernel_args`, optionally `TOPO_NAME_TO_KIND`"; **the contract an
algorithm-module author needs to follow** has had no ADR-level coverage.
This ADR pairs with ADR-0045's bench-module contract.
## First action
An algorithm module is imported at two moments:
1. **AHBM backend entry**: when user code calls
`dist.init_process_group(backend="ahbm")`,
`AhbmCCLBackend.__init__` runs
`self._algo_module = importlib.import_module(self._merged["module"])`.
At module level, the following occur first:
- Topology-kind integer constants like `SIP_TOPO_RING/TORUS/MESH`
are bound in the module namespace.
- The `TOPO_NAME_TO_KIND` dict is bound; the backend reads it via
`getattr(self._algo_module, "TOPO_NAME_TO_KIND", None)`.
- `kernel_args` function is defined for the caller.
- The actual algorithm function (e.g.,
`allreduce_intercube_multidevice`) is defined.
- At the bottom of the module, `kernel = allreduce_intercube_multidevice`
publishes the alias.
2. **ccl.yaml install stage**:
`kernbench.ccl.install.install_ipcq` imports the same algorithm
module while pushing the IPCQ neighbor table.
In short, **the algorithm module's first act is "publish topology-kind
constants, the `TOPO_NAME_TO_KIND` dict, the `kernel_args` function, and
the `kernel` alias into the module namespace"** — all as import-time
side effects, no separate initialization call.
## Context
`AhbmCCLBackend` (ADR-0047), at process-group creation, dynamically
imports a module path obtained from `ccl.yaml`'s `defaults.algorithm` (or
a user-specified algorithm). The backend expects four things from the
module:
- `kernel`: the collective's entry function.
- `kernel_args(world_size, n_elem, cube_w=, cube_h=) -> tuple`: a tuple
packing the kernel's positional arguments.
- `TOPO_NAME_TO_KIND` (optional): a dict mapping `topology.yaml`'s
`sips.topology` string (e.g., `"ring_1d"`, `"torus_2d"`,
`"mesh_2d_no_wrap"`) to the integer kind constants.
- (Indirectly) IPCQ neighbor-table install:
`configure_sfr_intercube_multisip` reads
the module's `TOPO_NAME_TO_KIND` plus cube dimensions to decide the
SFR.
The current corpus has one algorithm module:
`lrab_hierarchical_allreduce.py` (248 lines). The name expands to
"**l**eft-**r**ight **a**lternating **b**roadcast hierarchical allreduce".
When future modules like `ring_allreduce`, `tree_allreduce`, or
`broadcast` are added, they must follow this contract for the backend's
dispatch path to keep working.
Without an ADR-level contract:
- A new algorithm author has to infer the signature from ADR-0047 D3's
one-liner.
- The kernel-function argument order (especially `t_ptr, n_elem,
cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w,
sip_topo_h, tl`) is unclear without grep.
- It is conventional, but not documented, what `kernel_args` takes as
inputs and what tuple it must return.
## Decision
### D1. The algorithm module exposes four public symbols
```python
# src/kernbench/ccl/algorithms/<name>.py
from __future__ import annotations
# (required) topology-kind constants — referenced internally
SIP_TOPO_RING = 0
SIP_TOPO_TORUS = 1
SIP_TOPO_MESH = 2
# (optional) topology name → kind mapping. Used by the backend to
# translate ccl.yaml/topology's string SIP topology into an integer.
TOPO_NAME_TO_KIND = {
"ring_1d": SIP_TOPO_RING,
"torus_2d": SIP_TOPO_TORUS,
"mesh_2d_no_wrap": SIP_TOPO_MESH,
}
# (required) kernel argument builder
def kernel_args(world_size: int, n_elem: int, *, cube_w: int = 4, cube_h: int = 4) -> tuple:
return (n_elem, cube_w, cube_h, world_size)
# (required) kernel function (TLContext is injected via the `tl=...`
# keyword argument).
def my_allreduce_kernel(t_ptr, n_elem, cube_w, cube_h, n_sips,
sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, *, tl):
...
# (required) kernel alias — the backend accesses `module.kernel`
kernel = my_allreduce_kernel
```
- The `kernel` alias is the entry point the backend invokes. Whatever
the function name is (e.g., `allreduce_intercube_multidevice`), it
must be exposed via `module.kernel = fn`.
- Without `kernel_args`, the backend has no way to build the
algorithm's argument list. See D2 for the signature.
- If `TOPO_NAME_TO_KIND` is absent, the backend falls back to
`sip_topo_kind = 0`. An algorithm supporting only a single topology
may omit it.
### D2. `kernel_args` signature — `(world_size, n_elem, *, cube_w, cube_h)`
```python
def kernel_args(world_size: int, n_elem: int, *,
cube_w: int = 4, cube_h: int = 4) -> tuple:
return (n_elem, cube_w, cube_h, world_size)
```
- **Positional arguments**: `world_size` (= number of ranks), `n_elem`
(= element count of a single shard, f16-based).
- **Keyword arguments**: `cube_w`, `cube_h` (= cube-mesh dimensions).
Default 4×4 — aligned with `topology.yaml`'s `sip.cube_mesh` default.
- **Return**: a tuple in the order the kernel's positional arguments
expect.
When the backend calls `all_reduce`:
```python
kernel_args_tuple = self._algo_module.kernel_args(
self._world_size, n_elem, cube_w=eff_cube_w, cube_h=eff_cube_h,
)
extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)
pending = self.ctx.launch(
self._merged["algorithm"], kernel_fn, tensor,
*kernel_args_tuple, *extra_args, _defer_wait=True,
)
```
So the kernel's full positional argument list becomes: `(tensor_ptr,
*kernel_args_tuple, sip_rank, sip_topo_kind, sip_topo_w,
sip_topo_h)`, with `tl=...` injected as a keyword. The tuple length
and order returned by `kernel_args` must **match the kernel signature
1:1**.
### D3. Kernel signature — standardized 9 + tl arguments
Recommended signature:
```python
def my_kernel(
t_ptr: int, # VA base of the row-wise-sharded tensor on this SIP
n_elem: int, # element count per cube tile (or per shard)
cube_w: int, # cube mesh width (from kernel_args)
cube_h: int, # cube mesh height (from kernel_args)
n_sips: int, # equal to world_size (rank = SIP, ADR-0024)
sip_rank: int, # this SIP's rank
sip_topo_kind: int, # result of TOPO_NAME_TO_KIND lookup
sip_topo_w: int, # SIP mesh width (0 for ring_1d)
sip_topo_h: int, # SIP mesh height (0 for ring_1d)
*, tl, # TLContext (auto-injected)
) -> None:
```
Even if `kernel_args` chose a different positional argument order, the
kernel's **last four positional arguments are always
`(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)`** — the backend
appends them as `extra_args` (ADR-0047 D5). A custom algorithm must
accept these four, but a single-SIP algorithm may simply ignore them.
`tl` is injected via keyword — `RuntimeContext.launch` adds `tl=tl_ctx`
just before invoking the kernel. The signature therefore exposes `tl`
as keyword-only (`*, tl`) or as the trailing keyword parameter.
### D4. Kernel body — freedom and constraints
Available inside the kernel: every `tl.*` primitive from ADR-0046 D3.
Common patterns:
- `cube_id = tl.program_id(axis=1)` — this PE's cube index.
- `pe_addr = t_ptr + cube_id * nbytes` — per-cube VA of the tile.
- `acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")` — load local
data.
- `tl.send(dir=...)` / `tl.recv(dir=..., shape=, dtype=)` — IPCQ
collective.
- `acc = acc + recv` — TensorHandle arithmetic operators (ADR-0046 D4).
- `tl.store(pe_addr, acc)` — store the result.
The kernel body is plain Python — branching and loops are fine. But:
- No SimPy `yield` or `async` (ADR-0046 D1).
- No direct access to TensorHandle `.data` — the Phase 1 timing model
doesn't see data dependencies (ADR-0020's 2-pass separation).
- Kernel execution must be deterministic — the same input must produce
the same op sequence. No random or external IO.
### D5. SIP topology semantics — meaning of `sip_topo_kind`
The backend looks up `topology.yaml`'s `system.sips.topology` string
in the algorithm module's `TOPO_NAME_TO_KIND` and passes the integer
as `sip_topo_kind`. The algorithm then branches:
```python
if sip_topo_kind == SIP_TOPO_RING:
acc = _inter_sip_ring(...)
elif sip_topo_kind == SIP_TOPO_TORUS:
acc = _inter_sip_torus_2d(...)
elif sip_topo_kind == SIP_TOPO_MESH:
acc = _inter_sip_mesh_2d(...)
```
Each topology branch communicates with peers via IPCQ direction names
(`"global_E"`, `"W"`, `"S"`, `"N"` …). Direction semantics are defined
in ADR-0023/0025; `configure_sfr_intercube_multisip` installs the IPCQ
neighbor table accordingly.
If a topology kind not supported by the algorithm appears, prefer an
explicit `raise ValueError(f"unsupported topology kind
{sip_topo_kind}")` over a silent no-op — fail fast on misconfiguration.
### D6. The `ccl.yaml` algorithm entry
The algorithm module is paired with a `ccl.yaml` entry (ADR-0023 D10 +
ADR-0047 D3):
```yaml
defaults:
algorithm: lrab_hierarchical_allreduce
n_elem: 8
algorithms:
lrab_hierarchical_allreduce:
module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
# optional: world_size override
# optional: per-algorithm parameters consumed by configure_sfr_intercube_multisip
```
- `module`: the full Python module path; `importlib.import_module`
consumes this string as-is.
- `world_size` (optional): when set, overrides the topology fallback
(ADR-0047 D2).
- Algorithm-specific parameters are consumed by
`configure_sfr_intercube_multisip`.
Workflow to add a new algorithm:
1. Write `src/kernbench/ccl/algorithms/<name>.py` following D1.
2. Add the entry under `algorithms` in `ccl.yaml`.
3. (If needed) extend `kernbench.ccl.sfr_config` with the SFR-install
branch.
4. Add tests (e.g., `tests/sccl/test_<name>.py`, extending the
ADR-0043 eval harness).
### D7. Legacy "rank = flat PE index" mode
The `world_size` override in `ccl.yaml`, surfaced by ADR-0047 D2, is
used by legacy "rank = flat PE index" tests. The algorithm module can
assume `n_sips=world_size` ranks even in this mode — the backend
maintains the rank↔(SIP, cube, PE) mapping, so no modal branching is
needed inside the algorithm body.
In single-cube workloads (where `cube_w=cube_h=1`), the algorithm must
skip mesh-based phases — see the
`single_cube = (cube_w == 1 and cube_h == 1)` pattern in
`lrab_hierarchical_allreduce.py`.
## Alternatives Considered
### A1. Organize the algorithm module as a class (`class Allreduce: kernel(...)`)
Rejected. The Python module namespace already identifies an algorithm
(see ADR-0047 D3's `importlib.import_module`). A class wrapper adds
indirection without simplifying dispatch. Module-level free functions
plus a `kernel` alias are clean and obvious.
### A2. Type `kernel_args` with an explicit dataclass
Rejected (currently). Each algorithm normally has a different argument
count; forcing one dataclass would hurt cross-algorithm interchange.
The tuple return is simple and unpacks cleanly with the backend's
`*kernel_args_tuple`. If an algorithm wants stronger internal typing,
it may define its own NamedTuple.
### A3. Move SFR installation inside the algorithm module
Rejected. SFR installation
(`configure_sfr_intercube_multisip`) is a cross-module decision
combining topology + algorithm; `kernbench.ccl.sfr_config` is a more
natural home than the algorithm module itself. D6's "extend
sfr_config if needed" workflow keeps responsibility boundaries clear.
### A4. Auto-register algorithm names via a decorator (analogous to ADR-0045's `@bench`)
Rejected. Unlike benches, algorithms are already tied to `ccl.yaml`
entries; an additional registry would be redundant. The string mapping
in `module` is sufficient.
## Consequences
- ADR-0047 D3's one-line contract expands to a D1D7 author-facing
guide; new algorithm signatures no longer need to be grep-derived.
- D3's standardized 9 + tl signature couples naturally with the
backend's `extra_args` append (ADR-0047 D5). It is explicit that
even single-SIP-only algorithms must accept the four `sip_*` trailing
arguments.
- D5's fail-loud recommendation means a `ccl.yaml` topology that the
algorithm doesn't support will surface as an explicit `ValueError`
rather than a silent wrong result.
- D6's step-by-step addition workflow makes clear how far a new
algorithm has to reach into sfr_config / tests / ccl.yaml.
+288
View File
@@ -0,0 +1,288 @@
# ADR-0051: Routing Helper API — `AddressResolver` + `PathRouter`
## Status
Accepted (2026-05-22).
Pins down every public API, argument, return value, and adjacency-graph
selection of the two helper classes (`AddressResolver`, `PathRouter`)
exposed by `policy/routing/router.py`. ADR-0002 defines routing
distance, ordering, and bypass rules, but **the helper API surface
itself** has had no ADR-level coverage.
## First action
### `AddressResolver(graph)`
On construction, caches two pieces of state:
1. `self._node_ids = set(graph.nodes)` — a set of all node ids for
lookup.
2. `self._hbm_slice_bytes = hbm_total_gb * (1 << 30) // slices_per_cube`
— derived from `graph.spec.cube.memory_map` (default `48 GB / 8
slices = 6 GB`). `resolve()` uses this value to decode `pe_id` from
an HBM PA's `hbm_offset`.
In short, **AddressResolver's first act is "precompute the full set of
node ids and the HBM slice size"**. It does not retain the graph
itself.
### `PathRouter(graph)`
On construction, **builds four separate adjacency graphs in one pass**:
1. `self._adj_all`: every edge (used for component-to-component
routing).
2. `self._adj`: edges with `kind != "command"` (PE DMA / generic data
paths).
3. `self._adj_mcpu_dma`: excludes
`_MCPU_DMA_EXCLUDE = {"pe_internal", "pe_to_router"}` (M_CPU DMA
must not pass through PE pipeline nodes).
4. `self._adj_local`: excludes the 8-element `_UCIE_KINDS` set (UCIe
would look like a zero-distance bus to Dijkstra, which would prefer
it over the mesh — for cube-local routing this must be avoided).
Each graph is a `defaultdict(list)` of `(neighbor, weight)`. The
weight is `edge.routing_weight_mm or edge.distance_mm`.
In short, **PathRouter's first act is "classify topology edges into
four policy-specific adjacency lists simultaneously"**. Each `find_*()`
call picks the appropriate graph and runs Dijkstra.
## Context
`policy/routing/router.py` performs two responsibilities together:
- **Naming**: it is the sole owner of the topology naming convention
(`sip{S}.cube{C}.<comp>`, `sip{S}.io{I}.pcie_ep`, etc.). Components /
probe / IPCQ install / runtime API do not build node-id strings
themselves — they call helpers.
- **Path decisions**: policy separation by `edge.kind`. For the same
src→dst, different routing intents (PE DMA vs M_CPU DMA vs general
component routing) call for different adjacencies and so produce
different paths.
This helper API is widely consumed (probe.py / distributed.py /
install.py / various components / tests), yet **the exact signatures /
return semantics / adjacency picks** are not gathered in any ADR. This
ADR closes that gap.
## Decision
### D1. `AddressResolver` exposes five public methods
#### D1.1. `resolve(addr: PhysAddr) -> str`
Translates a `PhysAddr` to a destination node id in the topology:
```
addr.kind == "hbm" → f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
where pe_id = addr.hbm_offset // self._hbm_slice_bytes (ADR-0017 D4/D9)
addr.kind == "pe_resource":
addr.unit_type == PE → f"sip{s}.cube{d}.pe{addr.pe_id}.pe_tcm"
addr.unit_type == SRAM → f"sip{s}.cube{d}.sram"
addr.unit_type == MCPU → f"sip{s}.cube{d}.m_cpu"
others → RoutingError("unsupported unit_type")
other kinds → RoutingError("unsupported address kind")
```
If the derived node id is not in `self._node_ids`, raises
`RoutingError(f"node {node_id} not found in topology")`. So even when
the address has valid syntax, an absent node in the topology
fails-loud.
#### D1.2. `find_m_cpu(sip, cube) -> str`
Returns `f"sip{sip}.cube{cube}.m_cpu"`; absent → `RoutingError`.
#### D1.3. `find_pcie_ep(sip, io_id="io0") -> str`
Returns `f"sip{sip}.{io_id}.pcie_ep"`; absent → `RoutingError`.
#### D1.4. `find_io_cpu(sip, io_id="io0") -> str`
Returns `f"sip{sip}.{io_id}.io_cpu"`; absent → `RoutingError`.
#### D1.5. `find_all_pcie_eps() -> list[str]`
All PCIE_EP node ids across all SIPs, sorted. Filtered by
`endswith(".pcie_ep")`. Cross-SIP IPCQ uses this when enumerating
PCIE_EPs.
This class is the sole owner of the naming convention
(`sip{S}.cube{C}.<comp>`, `sip{S}.{io_id}.<comp>`) — ADR-0015 D4.
The topology builder produces nodes with the same naming convention;
components never build node-id strings directly — they go through
these helpers.
### D2. `PathRouter`'s four adjacency graphs
Constructed in one pass. `edge.kind` drives policy:
| graph | excluded edge kinds | use case |
|-------------------|--------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|
| `_adj_all` | (none) | M_CPU↔NOC command included, IO_CPU/M_CPU routes |
| `_adj` | `"command"` | PE DMA / generic data paths |
| `_adj_mcpu_dma` | `"pe_internal"`, `"pe_to_router"` | M_CPU DMA (skips PE pipeline) |
| `_adj_local` | `_UCIE_KINDS` (`ucie_internal`, `ucie_conn_to_router`, `router_to_ucie_conn`, `ucie_conn_to_noc`, `noc_to_ucie_conn`, `ucie_mesh`, `io_to_cube`, `cube_to_io`) | same-cube routing (UCIe bus excluded) |
Each graph is `dict[node_id, list[(neighbor, weight)]]` with weight =
`edge.routing_weight_mm or edge.distance_mm`. Excluding command edges
prevents them from influencing routing; isolating `_adj_local` keeps
UCIe's "zero-distance bus" from out-competing the mesh — consistent
with ADR-0017 D7's cross-PE-slice mesh-distance requirement.
### D3. `PathRouter` exposes six public methods (+ two backward-compat shims)
#### D3.1. `find_path(src_pe: str, dst_node: str) -> list[str]`
**PE DMA routing**. `src_pe` is a PE prefix (e.g.,
`"sip0.cube0.pe0"`); the function auto-prepends `.pe_dma`, making the
true start node `"sip0.cube0.pe0.pe_dma"`.
Adjacency depends on cube-locality (`_same_cube`):
- **Same-cube** (src and dst share `sip{S}.cube{C}.` prefix): uses
`_adj_local`. Excluding UCIe lets cross-PE-slice access pay accurate
mesh distance (ADR-0017 D7).
- **Cross-cube**: uses `_adj`. UCIe naturally becomes the right choice
for the cross-cube portion.
#### D3.2. `find_path_with_distance(src_pe, dst_node) -> tuple[list[str], float]`
Same adjacency policy as D3.1, but returns `(path, total_distance)`.
Used by probe and analysis tools that need the distance metric.
#### D3.3. `find_mcpu_dma_path(m_cpu_id: str, dst_hbm_id: str) -> list[str]`
**M_CPU DMA path**. Same cube → `_adj_local` (stay within the mesh);
different cube → `_adj_all` (cross via UCIe). The
`_MCPU_DMA_EXCLUDE` set ensures PE-pipeline nodes never appear on
M_CPU's routes.
#### D3.4. `find_memory_path(src: str, dst: str) -> list[str]`
Direct memory path like
`pcie_ep → io_noc → cube → router mesh → hbm_ctrl`. Uses
`_adj_mcpu_dma` to exclude `pe_internal` and `pe_to_router`, so
host-issued reads/writes never leak into the PE pipeline. Probe
(ADR-0049 D1's H2D/D2H cases) calls this directly.
#### D3.5. `find_node_path(src: str, dst: str) -> list[str]`
Generic routing between arbitrary nodes, **including command edges**
(via `_adj_all`). IoCpuComponent / MCpuComponent use this when they
need to route through M_CPU ↔ NOC command-kind links.
#### D3.6. Backward-compat shims
- `_dijkstra(start, goal) -> list[str]` — thin wrapper for
`_run_dijkstra(self._adj, …)`.
- `_dijkstra_with_dist(start, goal) -> tuple[list[str], float]`
distance-aware variant.
Despite the underscore prefixes (suggesting internal API), existing
tests call these directly. New code should prefer D3.1D3.5; these two
shims are deprecation candidates.
### D4. Dijkstra — single-source shortest path
`_run_dijkstra_with_dist(adj, start, goal)`:
- `heapq` priority queue.
- `best: dict[node, distance]` — best known distance to each node.
- `prev: dict[node, predecessor]` — for path reconstruction.
- Edge weight = `routing_weight_mm or distance_mm`. The separation
matters because UCIe (and a few others) declare an explicit
`routing_weight_mm` distinct from physical `distance_mm`.
`start == goal` short-circuits to `([start], 0.0)`. Unreachable target
`RoutingError(f"no path from {start} to {goal}")`.
The algorithm is **deterministic**: identical graph + start/goal gives
the same path, satisfying SPEC R1 ("routing MUST be deterministic").
Tie-breaks follow `heapq`'s push order (Python list order is
deterministic).
### D5. Single-owner principle for helper-API decisions
The following decisions live only inside router.py:
- Naming convention: `sip{S}.cube{C}.<comp>`,
`sip{S}.{io_id}.<comp>`,
`sip{S}.cube{C}.hbm_ctrl.pe{pe_id}`.
- Adjacency policy: which edge kinds belong to which graph.
- Algorithm for recovering PE id from an HBM slice size.
- Dijkstra weight selection
(`routing_weight_mm or distance_mm`).
Breaking single ownership (e.g., a component starting to build
`f"sip{s}..."` itself) would explode the blast radius of naming-
convention changes. This aligns with ADR-0015 D4.
### D6. Consumers of the helper API
Methods listed in this ADR are called from (current corpus):
- `probes/probe.py` (ADR-0049): `find_pcie_ep`, `find_io_cpu`,
`find_m_cpu`, `find_node_path`, `find_mcpu_dma_path`,
`find_memory_path`, `find_path`, `resolve`.
- `runtime_api/distributed.py` (ADR-0047): indirectly (engine-internal
routing).
- `ccl/install.py` (ADR-0023): `find_all_pcie_eps`, `resolve`.
- `sim_engine/event_log.py`: like probe — `find_pcie_ep`,
`find_memory_path`.
- `components/builtin/m_cpu.py`, `components/builtin/io_cpu.py`:
`find_node_path`, `find_mcpu_dma_path`.
- Tests (test_routing.py, test_cross_sip_routing.py, …): most of
D3.1D3.5.
When a new consumer arrives, D1/D3 act as a first-pass guide on
whether an existing method matches the intent or a new one is needed.
## Alternatives Considered
### A1. One adjacency graph + per-call edge-kind filtering
Rejected. Re-filtering the graph on every `find_*()` call hurts
Dijkstra cache locality. Constructing four graphs in one pass (D2)
has modest memory cost (edges ≤ a few × 10⁴), and selection happens
in O(1) at call time.
### A2. Drive adjacency separation by separate edge metadata rather than `kind`
Rejected. `edge.kind` is already assigned by the topology builder
(ADR-0015 D4 + ADR-0017); a parallel metadata field would force
synchronization between two systems.
### A3. Use BFS with uniform weights instead of Dijkstra
Rejected. With per-edge `routing_weight_mm` (mesh link / UCIe /
IO-internal), BFS minimizes hop count rather than total
latency/distance. SPEC R1 + R2 require deterministic and accurate
routing, which BFS does not deliver.
### A4. Express the helper API as module functions instead of classes
Rejected. Each class
(`AddressResolver`, `PathRouter`) maintains caches
(`_node_ids`, `_hbm_slice_bytes`, four adjacency graphs) reused across
many routing queries on the same graph. Module functions would have
to rebuild state per call or go global, hurting safety and
performance.
## Consequences
- When components / probe / IPCQ install / runtime API all go through
router.py helpers, a naming-convention change (e.g., `.io0.`
`.iochiplet0.`) is a one-file edit (D5).
- D2's four-graph split is now ADR-locked, so when a new edge kind is
added (e.g., a new inter-die UCIe-link kind), the right adjacency
category is decided explicitly rather than by default.
- D3.1's same-cube vs cross-cube branching (ADR-0017 D7) is explicit,
so anyone changing routing knows which adjacency to touch.
- D6's consumer list bounds PR-review scope for helper-API changes,
and the backward-compat shims (D3.6) are flagged as deprecation
candidates.
@@ -0,0 +1,371 @@
# ADR-0052: OpLog + MemoryStore Schemas — sim_engine internals
## Status
Accepted (2026-05-22).
Pins down the `OpRecord` schema and the `record_start` / `record_end` /
`record_copy` behavior in `sim_engine/op_log.py`, plus the
(space, addr) namespace and read/write semantics of `MemoryStore` in
`sim_engine/memory_store.py`. ADR-0020 (2-pass data execution) declares
that these two facilities exist, but **the precise record fields and
semantics** had no ADR-level coverage, and several recent ADRs
(ADR-0046 D3.2's `tl.store` visibility, ADR-0023 D9's IPCQ copy
record) depend on these semantics.
## First action
### `OpLogger(memory_store=None)`
On construction, initialize three fields:
1. `self._records: list[OpRecord] = []` — accumulated records.
2. `self._pending: dict[int, dict] = {}` — partial records keyed by
`id(msg)` (created at `record_start`, completed at `record_end`).
3. `self._memory_store = memory_store` — optional MemoryStore
reference. Used to capture math-op input snapshots and dma_write
HBM-source snapshots.
Records and pending are empty; the `record_*` calls accumulate data
over time.
### `MemoryStore()`
On construction, initialize a single field:
`self._storage: dict[str, dict[int, np.ndarray]] = {}` — a two-level
dict (`space → addr → ndarray`). Inner dicts are created lazily as new
spaces appear.
In short, **both facilities' first act is "set up an empty accumulator
buffer plus a sparse, per-space dict"**. The first record / write
fills the fields when it arrives.
## Context
ADR-0020 D2/D5/D7 (2-pass data execution) declares:
- During Phase 1 (timing), `ComponentBase._on_process_start/end` hooks
call `OpLogger.record_start/end`, recording the time and metadata of
every data op.
- Phase 2 (data) replays the op log in `t_start` order to compute real
data.
- Data payloads live in `MemoryStore`, keyed by (space, addr).
Subsequent ADRs (ADR-0023 D9's IPCQ atomic write, ADR-0027's Megatron
TP scratch-overwrite avoidance, ADR-0046 D3.2's `tl.store` visibility)
depend on op_log and MemoryStore behavior, but **the exact record
fields / space names / snapshot timing** are only discoverable via
source grep. This ADR codifies them.
## Decision
### D1. `OpRecord` schema — seven fields
```python
@dataclass
class OpRecord:
t_start: float
t_end: float
component_id: str
op_kind: str # "memory" | "gemm" | "math" | "unknown"
op_name: str # e.g. "dma_read", "gemm_f16", "exp",
# "TileToken/DMA_READ", "composite_gemm",
# "ipcq_copy"
params: dict[str, Any]
dependency_ids: list[int] = field(default_factory=list)
```
- **`t_start` / `t_end`**: SimPy time (float ns). `t_start` is when the
component begins the op; `t_end` is completion. Duration =
`t_end - t_start`.
- **`component_id`**: the node id where the op occurred (e.g.,
`"sip0.cube0.pe0.pe_dma"`).
- **`op_kind`**: one of four. Phase 2 DataExecutor branches on this.
- **`op_name`**: a debug/analysis-friendly name. For a TileToken,
expands to `"TileToken/{stage_type}"` (e.g.,
`"TileToken/DMA_READ"`) to disambiguate stages.
- **`params`**: op-specific metadata dict (see D3).
- **`dependency_ids`**: currently unused (default `[]`). Reserved for
future cross-op dependency tracking.
### D2. `OpLogger.records` — guaranteed `t_start` sort
```python
@property
def records(self) -> list[OpRecord]:
self._records.sort(key=lambda r: r.t_start)
return self._records
```
A stable sort by `t_start` runs on each access. Records with the same
`t_start` preserve insertion order. Aligns with ADR-0020 D5's
"t_start stable ordering" requirement.
Phase 2 DataExecutor always accesses via the `records` property, so
even when `record_end` calls arrive out of `t_start` order (e.g., a
short op started later but finished earlier), the sequence handed to
Phase 2 is consistent.
### D3. `params` schema per `op_name` (matrix from `_extract_op_info`)
#### D3.1. `op_kind="memory", op_name="dma_read"` (DmaReadCmd)
```python
{"src_addr": int, "nbytes": int, "handle_id": str}
```
#### D3.2. `op_kind="memory", op_name="dma_write"` (DmaWriteCmd)
```python
{
"src_space": str, # handle.space ("tcm"|"hbm"|"sram"), default "tcm"
"src_addr": int, # handle.addr
"shape": tuple, "dtype": str,
"dst_space": "hbm", # DmaWrite always targets HBM
"dst_addr": int,
"nbytes": int,
"handle_id": str,
# When src_space == "hbm" at record_end, a snapshot is added (D4)
"snapshot": np.ndarray | None,
}
```
#### D3.3. `op_kind="gemm", op_name=f"gemm_{dtype_a}"` (GemmCmd)
```python
{
"src_a_addr": int, "src_b_addr": int, "dst_addr": int,
"shape_a": tuple, "shape_b": tuple, "shape_out": tuple,
"dtype_in": str, "dtype_out": str,
"m": int, "k": int, "n": int,
# ADR-0027: per-operand + output spaces preserved
"src_a_space": str, "src_b_space": str, "dst_space": str,
}
```
#### D3.4. `op_kind="math", op_name=msg.op` (MathCmd; op = "exp", "sum", "add", "where", …)
```python
{
"input_addrs": list[int], # addrs of input handles
"input_shapes": list[tuple],
"input_spaces": list[str],
"input_dtypes": list[str],
"dst_addr": int, "dst_space": str,
"shape_out": tuple, "dtype": str,
"axis": int | None, # only meaningful for reductions
# All inputs get snapshots at record_end (D4)
"input_snapshots": list[np.ndarray | None],
}
```
#### D3.5. `op_kind="gemm" or "math", op_name=f"composite_{op}"` (CompositeCmd)
```python
{
"op": str, # "gemm" | "math"
"out_addr": int, "out_nbytes": int,
# If op == "gemm", same fields as GemmCmd are added:
"src_a_addr": int, "src_b_addr": int,
"shape_a": tuple, "shape_b": tuple,
"dtype_in": str, "dtype_out": str,
"src_a_space": str, "src_b_space": str,
"dst_space": "hbm", "dst_addr": int, # = out_addr
}
```
If `op == "gemm"`, `op_kind = "gemm"`; otherwise `"math"`. An alias so
Phase 2 replays composite-gemm on the same path as `GemmCmd`.
#### D3.6. `op_kind="memory", op_name="ipcq_copy"` (record_copy path)
```python
{
"src_space": str, "src_addr": int,
"dst_space": str, "dst_addr": int,
"shape": tuple, "dtype": str, "nbytes": int,
"snapshot": np.ndarray | None, # passed by caller; if None, record_copy reads fresh
}
```
`PE_DMA._handle_ipcq_inbound` (ADR-0023 D9) emits this record so Phase
2 can replay the IPCQ slot's inbound copy. It bypasses
`record_start` / `record_end` and pushes directly via `record_copy()`.
#### D3.7. `op_kind="unknown", op_name=type(msg).__name__`
Fallback for messages `_extract_op_info` doesn't recognize. `params =
{}`. If DataExecutor encounters this kind, it skips — Phase 2 replay
is unaffected.
### D4. Snapshot capture timing
When `OpLogger._memory_store` is set, `record_end` performs:
- **Math op**: read every input
(addr/shape/space/dtype) from `self._memory_store.read(...)` and
attach an ndarray copy to `params["input_snapshots"]`. Read failure
`None`.
- **`dma_write` op**: snapshot the source **only if `src_space ==
"hbm"`** and attach to `params["snapshot"]`. TCM (PE scratch)
sources are **deliberately skipped** — TCM is repopulated by Phase 2
math/gemm replay, and a Phase-1-time snapshot would capture a
previous kernel's stale value (ADR-0027 postmortem: TP gemm →
all_reduce race).
- **`ipcq_copy`**: the caller passes the in-flight snapshot via
`snapshot=token.data`. If absent, `record_copy` attempts a fresh
read from MemoryStore.
Snapshots are taken with `.copy()` (fresh allocation), making them
safe against later storage mutation. This is the foundation of
ADR-0027's "cross-PE Phase 2 ordering" race-avoidance.
When `memory_store` is `None` (Phase 1 timing-only mode), all
snapshot steps are skipped. Only the timing portion of the record is
preserved; data replay is unavailable.
### D5. TileToken handling — `record_start` captures stage info
ADR-0014 D6's self-routing tile token (pipeline mode) may have already
advanced its `stage_idx` by the time `record_end` runs (the TileToken
caches the next stage's params as it moves to the next component).
Therefore:
`record_start` pre-saves the following in `pending[id(msg)]["snap"]`:
```python
snap["stage_type"] = stage.stage_type.name # "DMA_READ", "GEMM", ...
snap["stage_params"] = dict(stage.params) # copy of params at start time
```
`record_end` retrieves this snap and merges into params:
- Adds `params["stage_type"]` to final params.
- Merges `stage_params` keys (keeps existing values if any).
- If `op_name == "TileToken"`, rewrites it to
`f"TileToken/{stage_type}"` (e.g., `"TileToken/DMA_READ"`),
disambiguating different stages emitted by the same component.
Thanks to this, DMA_READ vs DMA_WRITE, FETCH vs STORE coming from the
same component (e.g., pe_dma) are distinguishable in reports.
### D6. `MemoryStore` — two-level (space, addr) dict
```python
class MemoryStore:
def __init__(self) -> None:
self._storage: dict[str, dict[int, np.ndarray]] = {}
def write(self, space, addr, data): self._storage[space][addr] = data
def read(self, space, addr, shape=None, dtype=None) -> np.ndarray: ...
def has(self, space, addr) -> bool: ...
def snapshot(self) -> MemoryStore: ...
```
#### D6.1. Space namespace
A string key. Standard values:
- `"hbm"`: HBM data (deploy_tensor + Phase 2 dma_write results).
- `"tcm"`: PE-local TCM (Phase 2 math/gemm output).
- `"sram"`: cube-level SRAM (ADR-0023 D9.7's IPCQ slot tier).
Other spaces (e.g., `"reg"`) are allowed — `_storage` is a lazy dict
that creates a new space when `write` first touches it.
#### D6.2. Address keying
`addr` is an integer. It may be a **physical address (PA) or a virtual
address (VA)** — `MemoryStore` itself doesn't know address-space
semantics; it just uses them as keys. Phase 1's `MemoryWriteMsg`
writes both PA and VA
(`_create_tensor` zero-inits at PA and at the VA base too); Phase 2
reads/writes via the addresses captured by op_log.
The caller decides `addr`'s meaning — `MemoryStore` provides only
lookup.
#### D6.3. read/write semantics — reference store (no copy)
`write(space, addr, data)`: stores the ndarray reference. **No copy.**
If the caller later mutates the same ndarray, the stored value
changes.
`read(space, addr, shape=None, dtype=None)`: returns the stored
ndarray reference. If `shape`/`dtype` are provided:
- `dtype != stored.dtype`: `arr.view(np_dtype)` reinterprets as a
view (no copy).
- `shape != stored.shape`: if `nbytes` matches, `arr.reshape(shape)`
is a view.
- `nbytes` mismatch → `ValueError`.
To detach the data, the caller must call `arr.copy()`. ADR-0027's
race-avoidance requires explicit `.copy()` in op_log snapshot steps
for exactly this reason.
#### D6.4. `has(space, addr) -> bool`
Existence check; does not materialize data.
#### D6.5. `snapshot() -> MemoryStore`
Shallow copy. Creates a new instance of inner dicts but shares
ndarray references. Used at Phase 2 init to fork from Phase 1's
store, so Phase 2 mutations don't affect Phase 1's remaining
consumers.
### D7. op_log assumes a single-threaded SimPy
`OpLogger`'s `_records` and `_pending` are lock-free. SimPy is
single-threaded, so nothing else can intrude between `record_start`
and `record_end` for the same message.
When multi-process kernbench (ADR-0047 D6) arrives, OpLogger must be
split per process — one OpLogger instance cannot receive records from
multiple processes.
## Alternatives Considered
### A1. Externalize op_log to SQLite / parquet
Rejected (currently). The in-memory list minimizes Phase 1 → Phase 2
hand-off latency. Externalization makes sense for long-running batch
runs but adds overhead for the current single-run workload.
### A2. Capture snapshots at `record_start`
Rejected. At `record_start`, inputs are often not yet populated (e.g.,
a math op's input is the output of a just-issued previous op).
`record_end` is the correct point.
### A3. Per-component MemoryStore
Rejected. The (space, addr) key already disambiguates effectively, and
splitting per component would complicate cross-PE IPCQ copy (ADR-0023
D9), which needs access to both source and destination stores.
### A4. Explicit dependency edges in op_log
Partially adopted. The `dependency_ids` field exists on `OpRecord` but
is currently unused (D1). Phase 2 DataExecutor orders via `t_start` +
a secondary sort (memory ops before math at the same `t_start`). When
an explicit dependency graph is required, this field is the home.
Current ordering rules are sufficient, so it remains unused.
## Consequences
- ADR-0020's op_log / MemoryStore declarations are expanded into the
concrete D1D6 schemas, so writing/modifying Phase 2 DataExecutor
doesn't need source-grep to learn field semantics.
- D3's per-`op_name` params matrix makes adding new ops (e.g., a new
reduction type) a question of branching in `_extract_op_info`.
- D4's per-op snapshot policy (math = input snapshot, dma_write =
HBM-only snapshot) is ADR-locked, so ADR-0027's race-avoidance
decision won't silently regress on future refactors.
- D6.3's reference-store semantics are explicit, putting mutation
safety on the caller. ADR-0027's explicit `.copy()` pattern is
justified.
- D7's single-thread assumption is recorded, so multi-process
kernbench (ADR-0047 D6's supersession candidate) will need OpLogger
separation when introduced.
@@ -0,0 +1,351 @@
# ADR-0053: Topology Builder + Visualizer Algorithms
## Status
Accepted (2026-05-22).
Pins down the key algorithmic choices of the topology compile and
visualization pipeline jointly implemented by `topology/builder.py`,
`topology/mesh_gen.py`, and `topology/visualizer.py`
placement-driven router attachment, mesh auto-layout, the source_hash
cache, view projections, and SVG rendering. ADR-0006 defines the
high-level intent of topology compilation (compiled topology, distance
extraction, automatic diagram generation), but **which algorithms the
builder actually uses** was only discoverable via source grep.
## First action
When `resolve_topology(path_str)` is called, four steps run in order:
1. **Path validation** (`builder.py::resolve_topology`):
`Path(path_str).expanduser().resolve()`, existence check, file
check. Failure → `FileNotFoundError` or `ValueError`.
2. **YAML parsing** (`_read_spec`): `yaml.safe_load`. Parse errors
yield a `ValueError` with line/column. Non-dict roots are
rejected.
3. **Auto-generate the mesh** (`mesh_gen.ensure_mesh_file`): create or
reuse a `cube_mesh.yaml` next to the topology file. Cache hit on
matching source_hash; miss triggers regeneration. This step decides
the cube NoC's router grid and attachment information.
4. **Compile the graph** (`_compile_graph`): system → IO chiplets →
cubes → inter-cube edges → IO↔cube edges → system↔IO edges, then
build four view projections (system, sip, cube, pe) and wrap into
a `TopologyGraph`.
In short, **topology compilation's first act is "read topology.yaml as
a dict, create/validate cube_mesh.yaml in the same directory, then
build the flat graph + 4-view projection in system → sip → cube → pe
order"**.
## Context
`topology/` package responsibilities:
- **builder.py** (1207 lines): turns topology.yaml into a
`TopologyGraph` (nodes + edges + 4 view projections).
- **mesh_gen.py** (305 lines): auto-decides the cube NoC's router
grid and PE/UCIe/M_CPU/SRAM attachment positions and caches them in
`cube_mesh.yaml`.
- **visualizer.py** (887 lines): generates four SVG diagrams (system /
sip / cube / pe) from a `TopologyGraph`.
ADR-0006 makes the high-level decision that "the result of topology
compilation is the single source for distance metadata and diagram
generation", but specific algorithms (e.g., placement-driven nearest-
router attachment, the HBM exclusion zone, which fields in source_hash
trigger regeneration) are not in any ADR.
In particular, these decisions are absent at ADR level:
- Why is mesh_gen cached in a separate file (`cube_mesh.yaml`)?
- Which fields are in source_hash, and which changes force
regeneration?
- Why placement coordinates in mm rather than cube coordinates?
- How are the HBM exclusion zone and UCIe N/S/E/W distribution
decided inside the mesh?
- What is the abstraction-level difference among the four view
projections (system/sip/cube/pe)?
This ADR captures these decisions in one place.
## Decision
### D1. Compile pipeline — six stages
`_compile_graph(spec)`:
1. **System nodes** (`_instantiate_system`): add system-level nodes
like `fabric.switch0` and the host CPU.
2. **Per-SIP loop** (`for sip_id in range(system.sips.count)`):
- **IO chiplets** (`_instantiate_io_chiplets`): create pcie_ep /
io_cpu / io_noc / io_ucie PHYs / conn nodes and their bidirectional
internal edges.
- **Cube instantiation** (`_instantiate_cube`): using
cube_mesh.yaml's router grid, instantiate cube routers, PE
sub-components (pe_cpu, pe_dma, pe_fetch_store, pe_gemm, pe_math,
pe_mmu, pe_tcm, pe_scheduler, pe_ipcq), m_cpu, sram, hbm_ctrl,
and their internal edges.
- **Inter-cube edges** (`_add_inter_cube_edges`): the UCIe
N/S/E/W mesh edges.
- **IO ↔ cube edges** (`_add_io_to_cube_edges`): connect io_noc to
each cube's edge UCIe phy.
3. **Switch ↔ IO edges** (`_add_system_to_io_edges`): bidirectional
edges between `fabric.switch0` and each SIP's `pcie_ep` (the
cross-SIP IPCQ path of ADR-0038 D3 + ADR-0010).
4. **Build four view projections**:
- `_build_system_view(spec)` — Tray level: SIPs and the system
switch.
- `_build_sip_view(spec)` — inside one SIP: cube mesh + IO
chiplet.
- `_build_cube_view(spec)` — inside one cube: router grid + PE /
M_CPU / SRAM / HBM_CTRL attachments.
- `_build_pe_view(spec)` — inside one PE: nine sub-components +
internal edges (pe_internal kind).
5. **Return `TopologyGraph`**: `TopologyGraph(spec, nodes, edges,
system_view, sip_view, cube_view, pe_view)`.
The six stages are **ordered for a reason**: only after cubes exist
do inter-cube edges have valid src/dst, and IO chiplets must precede
the IO ↔ cube edges that reference them. New node types must slot in
the right spot.
### D2. `cube_mesh.yaml` — a separate file with a source_hash cache
`mesh_gen.ensure_mesh_file(cube_spec, mesh_path)`:
1. Compute `source_hash = _compute_source_hash(cube_spec)` from these
input fields:
- `geometry` (cube_mm.w/h …).
- `pe_layout` (corners, pe_per_corner).
- `ucie.n_connections`.
- `memory_map.hbm_mapping_mode`.
- `placement` (m_cpu/sram pos_mm).
2. If `mesh_path` (= `cube_mesh.yaml` next to topology.yaml) exists
and `existing.source_hash == source_hash`, reuse it (cache hit).
3. Otherwise, generate a new mesh via
`_generate_mesh(cube_spec, source_hash)` and write to yaml.
Caching as a separate file because:
- Mesh generation involves nontrivial PE/UCIe/router attachment math
and is too expensive to redo every time.
- Multiple runs with the same cube spec must guarantee an identical
mesh.
- The resulting mesh is itself an inspectable / debuggable artifact.
The five fields listed in source_hash are the ones that determine
mesh shape; other changes (e.g., bandwidth, overhead_ns) do not
trigger mesh regeneration.
### D3. Cube NoC mesh auto-layout
`_generate_mesh(cube_spec)`:
#### D3.1. Rows / columns
- `pe_positions = _corner_pe_positions(cube_w, cube_h)`: PE-center
coordinates (mm) per corner (NW/NE/SW/SE). Hardcoded patterns like
`(1.5, 1.5)` and `(cube_w-1.5, cube_h-1.5)`; with `pe_per_corner=2`,
each corner has two PE positions.
- `col_xs = _compute_col_positions(...)`: union of PE x-coordinates,
plus relay columns inserted when any gap exceeds
`max_spacing = 3.0 mm`.
- `row_ys, rows_per_half = _compute_row_positions(cube_h,
n_connections, pe_positions)`:
- `n_conn = max(n_connections, 2)` (hot-path minimum).
- `rows_per_half = ceil(n_conn / 2)`.
- Top half + two HBM rows + bottom half. HBM sits at
`(cube_h/2 - 1.5, cube_h/2 + 1.5)`. The gap between PE rows and
HBM rows is `hbm_gap = 1.5 mm`.
#### D3.2. HBM exclusion zone
`hbm_row_start = rows_per_half`,
`hbm_row_end = rows_per_half + 1`.
`hbm_col_start = n_cols // 2 - 1`,
`hbm_col_end = n_cols // 2`.
Router slots inside this (row, col) rectangle are marked `None` (no
router). HBM controllers are added separately as
`hbm_ctrl.pe{X}` nodes following ADR-0017 D9's per-PE partition
pattern.
#### D3.3. PE attachment
Each corner's PEs map to a row:
- Top half: NW → row 0, NE → row 1 (top_corners index).
- Bottom half: SW → row `hbm_row_end + 1`, SE → row
`hbm_row_end + 2`.
Each PE's x-coordinate attaches to the nearest column's router
(`min(range(n_cols), key=lambda c: abs(col_xs[c] - pe_x))`).
Attachment items are `pe{pe_idx}.dma`, `pe{pe_idx}.cpu`,
`pe{pe_idx}.hbm` (pushed into the router's attach list).
#### D3.4. M_CPU / SRAM attachment — nearest router by Euclidean distance
For `placement.m_cpu.pos_mm` (default `[1.5, 5.5]`) and
`placement.sram.pos_mm` (default `[1.5, 8.5]`), find the router with
the smallest Euclidean distance and append `"m_cpu"` / `"sram"` to
its attach list.
#### D3.5. UCIe N/S/E/W distribution
`ucie_pe_rows = top_pe_rows + bot_pe_rows` (total
`2 * rows_per_half`).
- UCIe-E: one PE row at a time, attach `ucie_e.c{i}` to the rightmost
column's router.
- UCIe-W: attach `ucie_w.c{i}` to the leftmost column's router (E's
mirror).
- UCIe-N/S: split PE columns into left and right halves; attach to
the top row's / bottom row's matching columns.
Each UCIe connection is suffixed `c{i}`, distributing
ucie_n_connections PHYs (ADR-0017 D5+).
### D4. Node naming convention — single ownership
builder.py creates nodes with the following naming convention (the
single-owner principle from ADR-0051 D5):
- `fabric.switch0` — system-level switch.
- `sip{S}.{io_id}.{pcie_ep|io_cpu|io_noc|io_ucie.{dir}|conn.{id}}` —
IO chiplet.
- `sip{S}.cube{C}.{m_cpu|sram|hbm_ctrl.pe{X}|noc.r{R}c{C}|...}` —
inside cube.
- `sip{S}.cube{C}.pe{P}.{pe_cpu|pe_dma|pe_fetch_store|pe_gemm|pe_math|pe_mmu|pe_tcm|pe_scheduler|pe_ipcq}` —
PE sub-components.
Changing this convention requires updating both builder.py and
router.py's helpers (ADR-0051). Components never know the convention
directly — they only call the helpers.
### D5. Edge `kind` classification
Every edge gets a `kind`; routing policy (ADR-0051 D2) reads it. Major
kinds:
- `"pe_internal"` — within a PE between sub-components.
- `"pe_to_router"` — PE_DMA ↔ cube NoC router.
- `"router_mesh"` — between cube NoC routers.
- `"router_to_hbm"`, `"router_to_mcpu"`, `"router_to_sram"`,
`"sram_to_router"`, etc. — between cube-attached components.
- `"ucie_internal"`, `"ucie_conn_to_router"`,
`"router_to_ucie_conn"`, `"ucie_conn_to_noc"`,
`"noc_to_ucie_conn"`, `"ucie_mesh"` — UCIe-related.
- `"io_internal"` — inside IO chiplet.
- `"io_to_cube"`, `"cube_to_io"` — at the IO ↔ cube boundary.
- `"pcie"` — switch ↔ pcie_ep.
- `"command"` — control-plane edges only (e.g., M_CPU ↔ NOC; excluded
from PE DMA paths).
Adding a new edge kind requires picking a category in router.py's
four adjacency graphs (ADR-0051 D2). If you forget, it defaults to
`_adj_all` only, which can produce unintended routes.
### D6. View projection — four abstraction levels
`TopologyGraph` keeps four view projections alongside the flat
nodes+edges:
- **system_view** (`_build_system_view`): Tray level. SIP blocks and
`fabric.switch0`. PCIe links shown. For external high-level
overview.
- **sip_view** (`_build_sip_view`): inside one SIP — cube mesh + IO
chiplet (pcie_ep + io_cpu + io_noc). UCIe N/S/E/W appear as
cube-cube links.
- **cube_view** (`_build_cube_view`): inside one cube — router grid +
PE / M_CPU / SRAM / HBM_CTRL attachments + UCIe PHY edges. For
intra-cube routing / placement debugging.
- **pe_view** (`_build_pe_view`): inside one PE — nine sub-components
+ internal edges (pe_internal kind). For detailed PE-internal
dataflow review.
Views are selectively rendered via the spec's
`visualization.emit_views: [system, sip, cube]` (ADR-0006). The pe
view is omitted from default output but the code is retained for
detailed debugging.
### D7. visualizer.py — SVG diagram output
`emit_diagrams(graph, out_dir)` renders every view as SVG. Key
functions:
- `_render_view_svg(view)` — generic view render (no router grid).
- `_render_cube_view_svg(view, spec)` — cube-view specific (HBM block,
router grid layout, PE/M_CPU/SRAM/HBM placement).
- `_draw_node`, `_draw_edge` — node/edge visual representation.
- `_pick_scale`, `_compute_node_sizes` — auto-scaling.
The visualizer is a **derived artifact** (ADR-0006); changes here do
not pass production checks. Aligns with CLAUDE.md's "Derived
Artifacts" guidance.
### D8. Blast radius of spec changes
| spec field | effect | mesh regenerated? |
|---------------------------------------|---------------------|-------------------|
| `system.sips.count` | SIP count, node count | No |
| `sip.cube_mesh.w/h` | cube mesh shape | No |
| `cube.geometry.cube_mm.w/h` | cube size (mm) | **Yes** |
| `cube.pe_layout.corners/pe_per_corner`| PE attachment positions | **Yes** |
| `cube.ucie.n_connections` | UCIe PHY distribution | **Yes** |
| `cube.memory_map.hbm_mapping_mode` | HBM distribution mode | **Yes** |
| `cube.placement` | M_CPU/SRAM positions | **Yes** |
| `cube.memory_map.*` (besides above) | HBM capacity / BW | No |
| `*.links.*.bw_gbs` | edge bandwidth | No |
| `*.attrs.overhead_ns` | component latency | No |
The table mirrors D2's `_compute_source_hash` inputs. Changes that
require mesh regeneration automatically invalidate `cube_mesh.yaml`'s
source_hash.
## Alternatives Considered
### A1. Regenerate the mesh on every compile without a cache file
Rejected. The cost of mesh generation would be paid repeatedly (CLI
runs, probe, tests) for the same spec, and the human-inspectable
artifact would disappear.
### A2. Merge mesh generation into builder.py
Rejected (currently). It is a 305-line algorithm of its own, and the
mesh-layout decisions (placement-driven router attachment, HBM
exclusion zone) are different from builder's general node/edge
emission. Keeping it separate respects single-responsibility.
### A3. Express placement coordinates in cube coordinates (col/row)
Rejected. mm coordinates flow consistently between the visualizer and
mesh layout (for nearest-router computation). Cube coordinates are
undefined until the router grid is fixed, so they are unsuitable as
placement input.
### A4. Lazy view projection generation
Rejected (currently). The four views are cheap to build (typically <
100 ms), and eager construction guarantees `TopologyGraph` as the
single source of truth.
### A5. Visualizer output in formats besides SVG (PNG/PDF)
Rejected. SVG is vector + text-searchable + directly renderable in
browsers. PNG conversion, when required, is downstream
post-processing (e.g., rsvg-convert).
## Consequences
- ADR-0006's high-level intent is fleshed out via D1D7; topology
changes can be assessed quickly via D8's table.
- D3's mesh-layout algorithm is ADR-locked, so future PE attachment
patterns (e.g., a 6-zone HBM split) make clear which stage they
affect.
- D5's edge-kind list and D7's view structure are explicit, giving PR
reviewers a quick map of where (builder + router + visualizer) a
new component type ripples through.
- D2's source_hash invalidation rules are explicit, so a stale
`cube_mesh.yaml` (e.g., when only bandwidth changed) is recognized
as correct behavior.