# ADR-0017: Cube NOC 2D Mesh Architecture ## Status Accepted ## Context ADR-0003 D3 defines the cube-level NOC as a "distributed on-die fabric" but does not specify the internal routing model, contention semantics, or attachment topology. The implementation uses a 2D mesh router grid with XY routing and per-segment contention modeling. This ADR formalizes that architecture. ## Decision ### D1. NOC node and router grid Each cube contains a single NOC topology node (`sip{S}.cube{C}.noc`) implemented as `noc_2d_mesh_v1`. Internally, the NOC models a 2D router grid generated by `mesh_gen.py`. Grid properties: - Default dimensions: 6x6 routers (derived from PE layout + UCIe connections) - Router naming: `r{row}c{col}` (e.g., `r0c0`, `r5c5`) - HBM exclusion zone: center rows/columns are excluded where HBM physically occupies space (e.g., r2c2, r2c3, r3c2, r3c3) - Router positions are derived from physical PE corner placement and cube geometry The NOC overhead_ns is 0.0. Latency is modeled by Manhattan distance traversal within the mesh (distance_mm x ns_per_mm). ### D2. XY routing algorithm The NOC uses deterministic XY routing: 1. Horizontal segment: route from source X to destination X at source Y 2. Vertical segment: route from destination X at source Y to destination Y Each directed segment is identified by a unique link key: - Horizontal: `("H", y_band, x_min, x_max, direction)` - Vertical: `("V", x_band, y_min, y_max, direction)` Grid positions are snapped to the router grid, excluding the HBM zone. ### D3. Contention model Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions sharing a segment (same row or column band, same direction) contend for the resource. This models link-level serialization in a wormhole-routed mesh. With no contention, NOC traversal latency equals the Manhattan distance multiplied by `ns_per_mm`. Under contention, additional queueing delay is added by SimPy's resource scheduling. ### D4. NOC attachment points The NOC connects to all major cube-level components: ```text UCIe-N (conn x4) | +---------+---+---+---------+ | | | | PE0.dma ---+ r0c0 | ... | r0c5 +--- PE2.dma PE0.cpu <--+ | | +--< PE2.cpu | | | | UCIe-W ----+ ... | [HBM] | ... +---- UCIe-E (conn x4) | | zone | | (conn x4) | r2c0 | | | M_CPU <--->+ | | | | r3c0 | | | SRAM <---->+ | | | | | | | PE4.dma ---+ r4c0 | ... | r4c5 +--- PE6.dma PE4.cpu <--+ | | +--< PE6.cpu | | | | +---------+---+---+---------+ | UCIe-S (conn x4) xbar_top attached to: r0c0, r0c1, r1c4, r1c5 (top-half PE routers) xbar_bot attached to: r4c0, r4c1, r5c4, r5c5 (bottom-half PE routers) ``` ### D5. NOC edge bandwidths and distances | Connection | BW (GB/s) | Distance | Notes | | --- | --- | --- | --- | | PE_DMA -> NOC | 256.0 | Physical (PE pos) | Matches HBM slice BW | | NOC -> PE_CPU | - | 0.0 mm | Command path only | | NOC <-> xbar_top | 256.0 | 0.0 mm | Per xbar half | | NOC <-> xbar_bot | 256.0 | 0.0 mm | Per xbar half | | NOC <-> M_CPU | - | 0.0 mm | Command path | | NOC <-> SRAM | 128.0 x4 | 0.0 mm | 512 GB/s aggregate | | NOC <-> UCIe conn | 128.0 | 0.0 mm | Per connection, 4 per port | Distance 0.0 mm for most connections reflects the distributed nature of the NOC; the actual traversal distance is computed internally via Manhattan distance within the router grid. ### D6. UCIe decomposition and inter-cube traffic Each cube has 4 UCIe ports (N, S, E, W). Each port is decomposed into: - 1 `ucie-{PORT}` node: UCIe protocol endpoint (overhead = 8.0 ns) - 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe This decomposition enables N=4 independent NOC-to-UCIe connections per port, each with 128 GB/s bandwidth. Total aggregate per port: 512 GB/s. Inter-cube traffic path: ```text Source: PE_DMA -> NOC -> conn{i} -> ucie-{PORT} [UCIe link: 512 GB/s, 1.0mm seam distance] Target: ucie-{PORT} -> conn{i} -> NOC -> xbar -> HBM ``` UCIe overhead (8.0 ns) is applied at each ucie-{PORT} node, so a full crossing incurs 16 ns (TX port + RX port). ### D7. Data paths through the NOC **PE DMA to local HBM (same half):** ```text PE_DMA -> NOC -> xbar_top -> HBM_CTRL.slice{0-3} ``` **PE DMA to cross-half HBM:** ```text PE_DMA -> NOC -> xbar_top -> bridge -> xbar_bot -> HBM_CTRL.slice{4-7} ``` **PE DMA to remote cube HBM:** ```text PE_DMA -> NOC -> conn -> ucie-E -> [seam] -> ucie-W -> conn -> NOC -> xbar -> HBM ``` **Kernel Launch command to PE:** ```text [from io_noc] -> ucie -> conn -> NOC -> M_CPU -> NOC -> PE_CPU ``` **Shared SRAM access:** ```text PE_DMA -> NOC -> SRAM ``` ### D8. Mesh generation The router grid is generated by `mesh_gen.py` based on: - `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner - `cube.geometry`: cube physical dimensions and HBM zone - `cube.ucie.n_connections`: determines router count for UCIe attachment The generator produces a `mesh_data` dictionary containing: - Router grid with positions and HBM exclusion zones - PE-to-router attachments (pe_dma, pe_cpu per PE) - UCIe-to-router attachments (N/S/E/W, distributed across edge routers) - M_CPU and SRAM router attachments - xbar_top/bot router assignments (top-half vs bottom-half PE routers) ## Consequences - NOC provides position-aware routing with deterministic latency - Contention is captured per directed segment (not per-node) - All cube-internal traffic is explicitly routed through the NOC - HBM exclusion zone reflects physical die layout constraints - The mesh generation is fully parameterized by `topology.yaml` ## Links - ADR-0003 D3 (cube-level NOC definition — extended by this ADR) - ADR-0004 D1 (PE DMA to local HBM path via xbar) - ADR-0004 D3 (cross-half HBM via bridge) - ADR-0014 D1 (PE_DMA dual egress: xbar for HBM, NOC for non-HBM) - ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch) - ADR-0016 D1 (IOChiplet io_noc — analogous pattern at IO chiplet level)