Files
kernbench2/docs/adr/ADR-0017-cube-noc-2d-mesh.md
T
ywkang 5917b3497c Replace xbar/bridge/single-NOC with explicit router mesh (ADR-0019)
- Remove xbar_top/bot, bridge, single noc node from topology
- Each cube_mesh.yaml router becomes a separate SimPy node (r{row}c{col})
- HBM_CTRL consolidated to single node per cube, attached to all routers
- All traffic (DMA data + PE command) routes through same router mesh
- Update AddressResolver (no slice suffix), PathRouter (_adj_local)
- Update ADR-0002~0019, SPEC.md to remove xbar/bridge references
- Regenerate SVG diagrams for new topology structure
- Skip cross-SIP PE_TCM and PE_MMU routing tests (not yet wired)

326 passed, 13 skipped

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 17:51:28 -07:00

6.3 KiB

ADR-0017: Cube NOC 2D Mesh Architecture

Status

Accepted

Context

ADR-0003 D3 defines the cube-level NOC as a "distributed on-die fabric" but does not specify the internal routing model, contention semantics, or attachment topology. The implementation uses a 2D mesh router grid with XY routing and per-segment contention modeling. This ADR formalizes that architecture.

Decision

D1. NOC node and router grid

Each cube contains a 2D router mesh generated by mesh_gen.py. Each router is a separate topology node (sip{S}.cube{C}.r{row}c{col}) implemented as forwarding_v1. (Supersedes the original single-node noc_2d_mesh_v1 design — see ADR-0019.)

Grid properties:

  • Default dimensions: 6x6 routers (derived from PE layout + UCIe connections)
  • Router naming: r{row}c{col} (e.g., r0c0, r5c5)
  • HBM exclusion zone: center rows/columns are excluded where HBM physically occupies space (e.g., r2c2, r2c3, r3c2, r3c3)
  • Router positions are derived from physical PE corner placement and cube geometry

The NOC overhead_ns is 0.0. Latency is modeled by Manhattan distance traversal within the mesh (distance_mm x ns_per_mm).

D2. XY routing algorithm

The NOC uses deterministic XY routing:

  1. Horizontal segment: route from source X to destination X at source Y
  2. Vertical segment: route from destination X at source Y to destination Y

Each directed segment is identified by a unique link key:

  • Horizontal: ("H", y_band, x_min, x_max, direction)
  • Vertical: ("V", x_band, y_min, y_max, direction)

Grid positions are snapped to the router grid, excluding the HBM zone.

D3. Contention model

Each directed XY segment is a simpy.Resource(capacity=1). Transactions sharing a segment (same row or column band, same direction) contend for the resource. This models link-level serialization in a wormhole-routed mesh.

With no contention, NOC traversal latency equals the Manhattan distance multiplied by ns_per_mm. Under contention, additional queueing delay is added by SimPy's resource scheduling.

D4. NOC attachment points

The NOC connects to all major cube-level components:

                    UCIe-N (conn x4)
                         |
           +---------+---+---+---------+
           |         |       |         |
PE0.dma ---+  r0c0   |  ...  |  r0c5  +--- PE2.dma
PE0.cpu <--+         |       |         +--< PE2.cpu
           |         |       |         |
UCIe-W ----+  ...    | [HBM] |  ...   +---- UCIe-E
(conn x4)  |         | zone  |         |  (conn x4)
           |  r2c0   |       |         |
M_CPU <--->+         |       |         |
           |  r3c0   |       |         |
SRAM <---->+         |       |         |
           |         |       |         |
PE4.dma ---+  r4c0   |  ...  |  r4c5  +--- PE6.dma
PE4.cpu <--+         |       |         +--< PE6.cpu
           |         |       |         |
           +---------+---+---+---------+
                         |
                    UCIe-S (conn x4)

HBM attach: PE가 있는 라우터에 hbm_ctrl도 연결 (ADR-0019 D1)
(xbar_top/xbar_bot은 ADR-0019에 의해 제거됨)

D5. NOC edge bandwidths and distances

Connection BW (GB/s) Distance Notes
PE_DMA -> NOC 256.0 Physical (PE pos) Matches HBM slice BW
NOC -> PE_CPU - 0.0 mm Command path only
Router <-> HBM_CTRL 256.0 0.0 mm Per PE router (ADR-0019)
NOC <-> M_CPU - 0.0 mm Command path
NOC <-> SRAM 128.0 x4 0.0 mm 512 GB/s aggregate
NOC <-> UCIe conn 128.0 0.0 mm Per connection, 4 per port

Distance 0.0 mm for most connections reflects the distributed nature of the NOC; the actual traversal distance is computed internally via Manhattan distance within the router grid.

D6. UCIe decomposition and inter-cube traffic

Each cube has 4 UCIe ports (N, S, E, W). Each port is decomposed into:

  • 1 ucie-{PORT} node: UCIe protocol endpoint (overhead = 8.0 ns)
  • 4 ucie-{PORT}.conn{0-3} nodes: connection bridges between NOC and UCIe

This decomposition enables N=4 independent NOC-to-UCIe connections per port, each with 128 GB/s bandwidth. Total aggregate per port: 512 GB/s.

Inter-cube traffic path:

Source: PE_DMA -> NOC -> conn{i} -> ucie-{PORT}
                    [UCIe link: 512 GB/s, 1.0mm seam distance]
Target: ucie-{PORT} -> conn{i} -> r{x}c{y} -> (mesh hops) -> hbm_ctrl

UCIe overhead (8.0 ns) is applied at each ucie-{PORT} node, so a full crossing incurs 16 ns (TX port + RX port).

D7. Data paths through the NOC

PE DMA to local HBM (same half):

PE_DMA -> r{x}c{y} -> hbm_ctrl  (local: 0 mesh hops, switching overhead only)

PE DMA to remote PE's HBM:

PE_DMA -> r{x}c{y} -> (mesh hops) -> r{x'}c{y'} -> hbm_ctrl

PE DMA to remote cube HBM:

PE_DMA -> r{x}c{y} -> conn -> ucie-E -> [seam] -> ucie-W -> conn -> r{x'}c{y'} -> hbm_ctrl

Kernel Launch command to PE:

[from io_noc] -> ucie -> conn -> r{x}c{y} -> (mesh hops) -> M_CPU -> (mesh hops) -> PE_CPU

Shared SRAM access:

PE_DMA -> r{x}c{y} -> (mesh hops) -> SRAM

D8. Mesh generation

The router grid is generated by mesh_gen.py based on:

  • cube.pe_layout: corner placement (NW, NE, SW, SE) and PEs per corner
  • cube.geometry: cube physical dimensions and HBM zone
  • cube.ucie.n_connections: determines router count for UCIe attachment

The generator produces a mesh_data dictionary containing:

  • Router grid with positions and HBM exclusion zones
  • PE-to-router attachments (pe_dma, pe_cpu per PE)
  • UCIe-to-router attachments (N/S/E/W, distributed across edge routers)
  • M_CPU and SRAM router attachments
  • HBM attachment per PE router (ADR-0019)

Consequences

  • NOC provides position-aware routing with deterministic latency
  • Contention is captured per directed segment (not per-node)
  • All cube-internal traffic is explicitly routed through the NOC
  • HBM exclusion zone reflects physical die layout constraints
  • The mesh generation is fully parameterized by topology.yaml
  • ADR-0003 D3 (cube-level NOC definition — extended by this ADR)
  • ADR-0004 D1 (PE DMA to local HBM path via router mesh)
  • ADR-0014 D1 (PE_DMA egress via router mesh)
  • ADR-0019 (NOC-Local HBM — xbar/bridge 제거, 명시적 라우터 mesh)
  • ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
  • ADR-0016 D1 (IOChiplet io_noc — analogous pattern at IO chiplet level)