# ADR-0019: Per-Channel and Aggregated HBM Connection Models within CUBE NOC ## Status Accepted ## Context The CUBE-internal NOC must connect each PE to HBM. KernBench needs to evaluate two connectivity models: - **1:1 mode** — PE_DMA connects to N separate per-channel routers, each with its own link to hbm_ctrl. Models per-channel BW contention precisely. N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`). - **n:1 mode** — PE_DMA connects to a single aggregated router with one link to hbm_ctrl. Channels are treated as interleaved; only aggregate BW is modeled. Effective PE-local BW is identical under both modes (= N × per-channel BW); only the connectivity granularity differs. --- ## Decision ### D1. HBM Attaches to PE Routers Consolidate the current `hbm_ctrl.slice{0-7}` (8 nodes) into a **single `hbm_ctrl` node**, and attach the HBM access point to the same router where the PE is attached. - n:1 mode: PE's local HBM access goes directly from its own router (switching overhead only, 0 hops) - Remote PE's HBM access: reaches the target PE's router via mesh hops - The read/write resource model within the HBM controller is preserved Node naming changes: | Current | After Change | | ---- | ------- | | `sip0.cube0.hbm_ctrl.slice0` ~ `slice7` | `sip0.cube0.hbm_ctrl` (single) | In `mesh_gen.py`, add `pe{idx}.hbm` to the PE attachment so that the builder generates an edge between that router and hbm_ctrl. --- ### D2. Complete Removal of xbar, bridge, and Single NOC Node Remove all of the following nodes and related edges: - `{cube}.xbar_top`, `{cube}.xbar_bot` - `{cube}.bridge.left`, `{cube}.bridge.right` - `{cube}.noc` (single TwoDMeshNocComponent node) - Edges of type `noc_to_xbar`, `xbar_to_noc`, `xbar_to_hbm`, `hbm_to_xbar` - Edges of type `xbar_to_bridge`, `bridge_to_xbar` - Edges of type `pe_to_noc`, `noc_to_pe`, `noc_to_pe_cpu`, etc. referencing the single noc node Their role is replaced by an **explicit router mesh based on cube_mesh.yaml**. Each router (r0c0, r0c1, ...) from the 6x6 router grid generated by `mesh_gen.py` is created as a separate SimPy node in the topology graph, and adjacent routers are connected via XY mesh edges. --- ### D3. Explicit Router Mesh (Common Basis for n:1 / 1:1) #### Router Nodes Based on cube_mesh.yaml Each non-null router from cube_mesh.yaml generated by `mesh_gen.py` is created as a **separate SimPy node** in the topology graph. - Node ID: `{cube}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`) - kind: `noc_router`, impl: `forwarding_v1` - pos_mm: taken from cube_mesh.yaml Based on the attach information in cube_mesh.yaml, components are connected to each router: - `pe{p}.dma` → PE_DMA ↔ router edge - `pe{p}.cpu` → PE_CPU ↔ router edge - `pe{p}.hbm` → HBM_CTRL ↔ router edge (added in n:1) - `m_cpu` → M_CPU ↔ router edge - `sram` → SRAM ↔ router edge - `ucie_{dir}.c{i}` → UCIe conn ↔ router edge Router-to-router XY mesh edges: bidirectional edges between adjacent routers. Null routers (HBM exclusion zones) are skipped. #### 1:1 Mode Extension (To Be Implemented Later) In 1:1 mode, each router differentiates into N channel mini-routers. Per-channel routing and ChannelSplitter (LA → per-channel PA) introduction are required. N GEMM engines per PE are also added at this point. --- ### D4. Cross-PE HBM Access (n:1 Mode) In n:1 mode, when a PE accesses another PE's local HBM, it hops through the XY mesh in cube_mesh.yaml to reach the target PE's router. Example: PE0 (r0c0) accessing PE2's (r1c4) HBM: ```text PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl ``` The Dijkstra router finds the shortest path in the mesh. Cross-PE channel access in 1:1 mode will be defined during the 1:1 extension in D3. --- ### D5. n:1 Mode: Uses cube_mesh.yaml Router Mesh In n:1 mode, no separate "aggregated router" is created. The existing router grid from cube_mesh.yaml serves that role. #### Connection Structure PE_DMA, PE_CPU, and HBM are all connected to the router where each PE is attached: ```text sip0.cube0.pe0.pe_dma ←→ sip0.cube0.r0c0 (bw: N × channel_bw_gbs) sip0.cube0.hbm_ctrl ←→ sip0.cube0.r0c0 (bw: N × channel_bw_gbs) ``` Routers are connected via XY mesh edges. PE's local HBM access goes directly from its own router (switching overhead only). #### n:1 Mode Full Data Paths **Local HBM (0 hops):** ```text PE0.pe_dma → r0c0 → hbm_ctrl (switching overhead only) ``` **Remote HBM (mesh hops):** ```text PE0.pe_dma → r0c0 → r0c1 → ... → r1c4 → hbm_ctrl ``` **M_CPU DMA:** ```text M_CPU → r2c0 → (mesh hops) → r{x}c{y} → hbm_ctrl ``` --- ### D6. All Traffic Is Unified onto the Same Router Mesh - All memory accesses (DMA data) and commands (PE_CPU) use the same router mesh - Local access does not use a separate fast path (xbar) - Cross-cube (remote) access path: ```text PE_DMA → r{x}c{y} → (mesh hops) → ucie_conn → ucie-{PORT} → [UCIe link] → remote ucie → remote conn → remote r{x}c{y} → hbm_ctrl ``` UCIe connections maintain the existing structure, but both endpoints become mesh routers instead of xbars. The number of UCIe lines is determined by BW ratio: `ucie_lines_per_side = ceil(ucie_bw / noc_line_bw)`. --- ### D7. AddressResolver Changes Current `AddressResolver.resolve()`: ```python # Current: HBM offset → pe_slice → "sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}" pe_slice = PhysAddr.hbm_pe_id(addr.hbm_offset, self._slice_size_bytes) return f"sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}" ``` After change: ```python # Changed: HBM → single endpoint return f"sip{s}.cube{c}.hbm_ctrl" ``` The pe_slice calculation is removed. In n:1 mode, PE_DMA directly accesses the hbm_ctrl attached to its own router. resolver.resolve() is retained for external access (M_CPU DMA, etc.) and backward compatibility. --- ### D8. topology.yaml Configuration Changes #### Added Settings ```yaml cube: memory_map: hbm_mapping_mode: n_to_one # one_to_one | n_to_one hbm_pseudo_channels: 64 # total pseudo channel count hbm_channels_per_pe: 8 # local channels per PE (= pseudo_channels / pes_per_cube) hbm_channel_bw_gbs: 32.0 # per-channel bandwidth (GB/s) hbm_total_gb_per_cube: 48 # retained ``` #### Removed Settings ```yaml # To be removed links: xbar_to_hbm_bw_gbs: 256.0 # → replaced by channel_bw_gbs × channels_per_pe xbar_to_hbm_mm: 2.5 # → replaced by ch_router_to_hbm_mm xbar_to_bridge_bw_gbs: 128.0 # → removed (no bridge) xbar_to_bridge_mm: 3.0 # → removed noc_to_xbar_bw_gbs: ... # → removed noc_to_xbar_mm: ... # → removed ``` #### Added Link Settings ```yaml links: router_link_bw_gbs: 256.0 # XY mesh link BW between routers router_overhead_ns: 2.0 # router switching overhead pe_to_router_bw_gbs: 256.0 # PE_DMA ↔ router hbm_to_router_bw_gbs: 256.0 # HBM ↔ router (= N × channel_bw) ``` --- ### D9. Bandwidth Numerical Consistency | Configuration | Value | | ---- | --- | | pseudo channels per cube | 64 (parameter) | | PEs per cube | 8 (parameter) | | channels per PE (N) | `pseudo_channels / pes_per_cube` = 8 | | per-channel BW | 32 GB/s (parameter) | | per-PE local BW | N × 32 = 256 GB/s | | cube total HBM BW | 64 × 32 = 2048 GB/s | The effective BW per PE is identical in both modes: - 1:1 mode: N channel links × channel_bw_gbs = N × 32 = 256 GB/s - n:1 mode: 1 aggregated link = N × channel_bw_gbs = 256 GB/s --- ## Consequences ### Positive - The router mesh based on cube_mesh.yaml accurately reflects physical placement - In n:1 mode, the existing VA scheme is preserved, keeping transition costs low - Local / remote / command traffic is unified onto the same mesh, resulting in simplicity - Aligns well with graph compiler-based topology generation - Channel count and PE count are both parameterized, enabling testing of various configurations - 1:1 mode extension naturally follows through router differentiation ### Negative - The number of SimPy nodes increases due to explicit router nodes (6x6 = up to 32 routers/cube) - The internal contention model of TwoDMeshNocComponent needs to be replaced with a per-router model --- ## Alternatives ### A1. Retain Existing xbar + HBM Slices - Local/remote paths remain bifurcated - Cannot model at pseudo-channel granularity - Cannot switch between 1:1/n:1 modes ### A2. Always Generate Per-Channel Links and Aggregate Only in n:1 - Topology structure always has 1:1 size - Expressing n:1 semantics via link aggregation is complex - No reduction in router node count ### A3. Gradual Transition (Retain xbar + Add NOC Path) - Higher compatibility, but dual-path coexistence increases complexity - Since xbar removal is ultimately necessary, the intermediate step provides little value --- ## Test Requirements - Verify that requests are delivered via per-channel links in 1:1 mode - Verify that requests are delivered via the aggregated link in n:1 mode - Verify that topology is correctly generated in both modes: - 1:1: `total_ch` channel routers + per-PE links + horizontal links - n:1: `pes_per_cube` aggregated routers + per-PE links - Verify that effective BW is consistent across both modes for the same workload - Verify that horizontal line routing works for cross-PE access - Verify that routing through UCIe works for cross-cube access - Verify that topology generation is correct under parameter variations (channels_per_pe = 4, 8, 16, etc.) --- ## Links - ADR-0011 (LA model) → addressing-side integration - ADR-0017 (Cube NOC 2D Mesh) → this ADR replaces the xbar/bridge portion - ADR-0004 (Memory Semantics) → BW model redefinition - ADR-0014 (PE Internal Execution Model) → impact from PE_DMA path changes