# ADR-0017: Cube NOC and HBM Connectivity ## Status Accepted ## Context The CUBE-level NOC is a 2D router mesh that carries every intra-cube request: PE-to-HBM data, PE-to-PE traffic, command paths (M_CPU↔PE_CPU), shared SRAM access, and inter-cube UCIe traffic. The CUBE's HBM is exposed through per-PE controller endpoints attached to PE routers. This per-PE partitioning makes local-vs-remote HBM distinguishable by mesh distance: a PE's own HBM partition sits at its own router (switching overhead only); another PE's HBM partition is reachable by mesh hops to that PE's router. Two channel-mapping modes are supported in the design space: - **n:1 (default, implemented)** — each PE's HBM partition aggregates `channels_per_pe` pseudo-channels into one endpoint. Effective per-PE BW = N × per-channel BW. - **1:1 (future)** — each PE router decomposes into per-channel mini-routers; per-channel BW contention is modeled directly. In both modes the per-PE effective BW is identical; only the connectivity granularity differs. ## Decision ### D1. 2D router mesh Each cube contains a 2D mesh of NOC routers generated by `mesh_gen.py`. - Node naming: `sip{S}.cube{C}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`). - Implementation: `forwarding_v1`. NOC `overhead_ns = 0`. - Default 6×6 grid (sized from PE corner placement + UCIe attachment count); larger PE counts scale the grid up. - HBM exclusion zone: center rows/columns are excluded where HBM die physically occupies space (e.g., r2c2, r2c3, r3c2, r3c3 for a 6×6). - Latency = Manhattan distance × `ns_per_mm`. ### D2. XY routing algorithm Deterministic XY routing: 1. Horizontal segment: route from source X to destination X at source Y. 2. Vertical segment: route from destination X at source Y to destination Y. Each directed segment carries a unique key: - Horizontal: `("H", y_band, x_min, x_max, direction)` - Vertical: `("V", x_band, y_min, y_max, direction)` Grid positions are snapped to the router grid, excluding the HBM zone. ### D3. Per-segment contention model Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions sharing a segment (same row or column band, same direction) contend for the resource — modelling link-level serialization in a wormhole-routed mesh. With no contention, NOC traversal latency equals Manhattan distance × `ns_per_mm`. Under contention, SimPy's resource scheduling adds queueing delay. ### D4. NOC attachment points (per-PE HBM partition) Every PE router carries three attachments: `pe{idx}.dma`, `pe{idx}.cpu`, and `pe{idx}.hbm`. The last is the per-PE HBM controller endpoint — `sip{S}.cube{C}.hbm_ctrl.pe{idx}` — which owns one slice of the cube's HBM (one pseudo-channel group; see D8). Other attachments: - M_CPU and shared SRAM each occupy a dedicated edge router. - UCIe endpoints (N/S/E/W) each expose 4 connection routers distributed along that edge (see D6). ```text UCIe-N (conn x4) | +---------+---+---+---------+ | | | | PE0.dma ---+ r0c0 | ... | r0c5 +--- PE2.dma PE0.cpu <--+ +hbm.pe0| | +hbm.pe2+--< PE2.cpu | | | | UCIe-W ----+ ... | [HBM] | ... +---- UCIe-E (conn x4) | | zone | | (conn x4) | r2c0 | | | M_CPU <--->+ | | | | r3c0 | | | SRAM <---->+ | | | | | | | PE4.dma ---+ r4c0 | ... | r4c5 +--- PE6.dma PE4.cpu <--+ +hbm.pe4| | +hbm.pe6+--< PE6.cpu | | | | +---------+---+---+---------+ | UCIe-S (conn x4) ``` Per-PE HBM partitioning is the key invariant that makes local vs cross-PE HBM distinguishable by mesh distance (see D7). ### D5. NOC edge bandwidths and distances | Connection | BW (GB/s) | Distance | Notes | | ----------------------------- | ---------- | ------------- | ------------------------------------------- | | PE_DMA → NOC | 256.0 | Physical (PE) | Matches local-HBM aggregate BW | | NOC → PE_CPU | — | 0.0 mm | Command path only | | Router ↔ hbm_ctrl.pe{idx} | 256.0 | 0.0 mm | Per PE router; N × per-channel BW (see D8) | | NOC ↔ M_CPU | — | 0.0 mm | Command path | | NOC ↔ SRAM | 128.0 × 4 | 0.0 mm | 512 GB/s aggregate | | NOC ↔ UCIe conn | 128.0 | 0.0 mm | Per connection; 4 conn per port | `0.0 mm` distances reflect the distributed nature of the NOC; actual traversal distance is computed via Manhattan distance within the router grid. ### D6. UCIe decomposition and inter-cube traffic Each of the 4 UCIe ports (N, S, E, W) decomposes into: - 1 `ucie-{PORT}` node: UCIe protocol endpoint (`overhead = 8.0 ns`). - 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe. This decomposition gives 4 independent NOC↔UCIe connections per port, each with 128 GB/s bandwidth (512 GB/s aggregate per port). Inter-cube traffic path: ```text Source: PE_DMA → NOC → conn{i} → ucie-{PORT} [UCIe link: 512 GB/s, 1.0mm seam distance] Target: ucie-{PORT} → conn{i} → r{x}c{y} → (mesh hops) → hbm_ctrl.pe{idx} ``` UCIe overhead (8.0 ns) is applied at each `ucie-{PORT}` node, so a full crossing incurs 16 ns (TX port + RX port). ### D7. Data paths through the NOC All intra-cube traffic uses the same router mesh — no separate fast paths. **Local HBM** (same PE's own partition; 0 mesh hops): ```text PE_DMA → r{x}c{y} → hbm_ctrl.pe{idx} (switching overhead only) ``` **Cross-PE HBM within cube** (target PE's partition, reached by mesh): ```text PE_DMA → r{x}c{y} → (mesh hops) → r{x'}c{y'} → hbm_ctrl.pe{idx'} ``` Example: PE0 (on `r0c0`) accessing PE2's HBM (PE2 on `r1c4`): ```text PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl.pe2 ``` Dijkstra computes the shortest path within the mesh. **Cross-cube HBM** (UCIe traversal): ```text PE_DMA → r{x}c{y} → conn → ucie-{PORT} → [seam] → ucie-{PORT'} → conn → r{x'}c{y'} → hbm_ctrl.pe{idx'} ``` **Kernel launch command to PE**: ```text [from io_noc] → ucie → conn → r{x}c{y} → (mesh) → M_CPU → (mesh) → PE_CPU ``` **Shared SRAM access**: ```text PE_DMA → r{x}c{y} → (mesh) → SRAM ``` ### D8. HBM channel mapping mode Channel mapping is configured at cube scope: ```yaml cube: memory_map: hbm_mapping_mode: n_to_one # one_to_one | n_to_one hbm_pseudo_channels: 64 # total pseudo-channel count hbm_channels_per_pe: 8 # per-PE local channel count hbm_channel_bw_gbs: 32.0 # per-channel bandwidth (GB/s) hbm_slices_per_cube: 8 # number of per-PE partitions hbm_total_gb_per_cube: 48 ``` **n:1 mode (default, implemented).** Each PE's HBM partition is a single endpoint `hbm_ctrl.pe{idx}` that aggregates `channels_per_pe` pseudo- channels. The `Router ↔ hbm_ctrl.pe{idx}` link bandwidth equals `channels_per_pe × hbm_channel_bw_gbs`. Pseudo-channels are assumed to interleave; only aggregate per-PE BW is modeled. No separate aggregated router node exists — the per-PE router itself serves that role. **1:1 mode (future).** Each PE router decomposes into N channel mini-routers; per-channel routing carries fully-resolved PA + channel ID. A `ChannelSplitter` resolves a logical access to N per-channel physical requests. Per-channel link models BW contention. Cross-PE channel access semantics are deferred to the implementation ADR. **BW math (defaults).** | Parameter | Value | | ---------------------------------- | -------------------------- | | pseudo channels per cube | 64 (parameter) | | PEs per cube | 8 (parameter) | | channels per PE (N) | 64 / 8 = 8 | | per-channel BW | 32 GB/s (parameter) | | per-PE local BW | N × 32 = 256 GB/s | | cube total HBM BW | 64 × 32 = 2048 GB/s | Both modes give the same per-PE effective BW; only the request shape and contention model differ. ### D9. AddressResolver — per-PE HBM endpoint The address resolver decodes a PA's HBM offset to the owning PE's partition: ```python # policy/routing/router.py hbm_slice_bytes = hbm_total_gb_per_cube * (1 << 30) // hbm_slices_per_cube if addr.kind == "hbm": pe_id = int(addr.hbm_offset) // hbm_slice_bytes return f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}" ``` The pe_id computation is intrinsic to the routing layer (not a topology-time concern). Any HBM PA falls within exactly one partition, yielding deterministic routing. External callers (e.g., M_CPU DMA, Memory R/W from PCIE_EP) follow the same resolver path — there is no separate fast path. ### D10. Mesh generation parameters `mesh_gen.py` produces `cube_mesh.yaml` from: - `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner. - `cube.geometry`: cube physical dimensions and HBM zone. - `cube.ucie.n_connections`: determines router count for UCIe attachment. Output `mesh_data` dictionary contains: - Router grid with positions and HBM exclusion zones. - PE-to-router attachments (`pe{idx}.dma`, `pe{idx}.cpu`, `pe{idx}.hbm` per PE). - UCIe-to-router attachments (N/S/E/W distributed across edge routers). - M_CPU and SRAM router attachments. ## Consequences - Local HBM (0 mesh hops, switching overhead only) and cross-PE HBM (mesh hops) are naturally distinguishable, satisfying SPEC R5 (multi-domain communication) and ADR-0002 (no zero-latency end-to-end paths). - All cube-internal traffic routes through one mesh — single contention model, single layout, single set of edge BWs. - Per-PE HBM partitioning maps cleanly to the LA model (ADR-0011): each PE's partition is the n:1 aggregate of its assigned pseudo-channels. - 1:1 mode extension is structurally natural — split each PE router into N channel routers. - Mesh generation is fully parameterised by `topology.yaml`; PE/cube geometry changes propagate without code edits. ## Links - ADR-0002 (Routing distance, ordering, no zero-latency paths) - ADR-0003 D3 (cube-level NOC definition — extended here) - ADR-0004 (Memory semantics, local HBM) - ADR-0011 (Memory addressing — LA model consumes per-PE partition) - ADR-0014 D1 (PE_DMA egress via router mesh) - ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch) - ADR-0016 (IOChiplet io_noc — analogous pattern at IO chiplet level) - ADR-0033 (Latency model: per-PC parallelism, switch penalty)