Files
kernbench2/docs/adr/ADR-0019-NOC-Local HBM.en.md
T
ywkang b2c52f0e34 Add English translations for ADR-0018, 0019, 0020, 0021
- ADR-0018: LA-based memory address abstraction + BAAW + HBM channel mapping
- ADR-0019: CUBE NOC per-channel and aggregated HBM connection model
- ADR-0020: 2-pass data execution model (timing/data separation, greenlet)
- ADR-0021: PE pipeline refactor (component separation + token self-routing)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 16:31:32 -07:00

14 KiB
Raw Blame History

ADR-0019: Per-Channel and Aggregated HBM Connection Models within CUBE NOC

Status

Proposed

Context

ADR-0018 introduced LA-based address abstraction and BAAW, defining how a logical memory access is translated into the following two forms of requests:

  • 1:1 mode: one logical access → N per-channel requests
  • n:1 mode: one logical access → one aggregated request

Here N = hbm_pseudo_channels / pes_per_cube (= channels_per_pe), determined by topology parameters.

Problems with the Existing Structure

In the current implementation (topology/builder.py):

  • PE_DMA → NOC → xbar_top/xbar_bot → HBM_CTRL.slice{0-7} path is used
  • HBM is modeled as 8 slice (= per-PE) nodes
  • Local/remote access use different paths:
    • local: NOC → xbar → HBM slice
    • cross-half: NOC → xbar_top → bridge → xbar_bot → HBM slice
    • remote cube: NOC → UCIe → remote NOC → remote xbar → remote HBM slice

Limitations of this structure:

  • Cannot model at the pseudo-channel granularity (slice = per-PE granularity, not per-channel)
  • xbar/bridge bifurcate local/remote paths
  • Cannot express 1:1 / n:1 modes consistently

Decision

D1. HBM Attaches to PE Routers

Consolidate the current hbm_ctrl.slice{0-7} (8 nodes) into a single hbm_ctrl node, and attach the HBM access point to the same router where the PE is attached.

  • n:1 mode: PE's local HBM access goes directly from its own router (switching overhead only, 0 hops)
  • Remote PE's HBM access: reaches the target PE's router via mesh hops
  • The read/write resource model within the HBM controller is preserved

Node naming changes:

Current After Change
sip0.cube0.hbm_ctrl.slice0 ~ slice7 sip0.cube0.hbm_ctrl (single)

In mesh_gen.py, add pe{idx}.hbm to the PE attachment so that the builder generates an edge between that router and hbm_ctrl.


D2. Complete Removal of xbar, bridge, and Single NOC Node

Remove all of the following nodes and related edges:

  • {cube}.xbar_top, {cube}.xbar_bot
  • {cube}.bridge.left, {cube}.bridge.right
  • {cube}.noc (single TwoDMeshNocComponent node)
  • Edges of type noc_to_xbar, xbar_to_noc, xbar_to_hbm, hbm_to_xbar
  • Edges of type xbar_to_bridge, bridge_to_xbar
  • Edges of type pe_to_noc, noc_to_pe, noc_to_pe_cpu, etc. referencing the single noc node

Their role is replaced by an explicit router mesh based on cube_mesh.yaml. Each router (r0c0, r0c1, ...) from the 6x6 router grid generated by mesh_gen.py is created as a separate SimPy node in the topology graph, and adjacent routers are connected via XY mesh edges.


D3. Explicit Router Mesh (Common Basis for n:1 / 1:1)

Router Nodes Based on cube_mesh.yaml

Each non-null router from cube_mesh.yaml generated by mesh_gen.py is created as a separate SimPy node in the topology graph.

  • Node ID: {cube}.r{row}c{col} (e.g., sip0.cube0.r0c0)
  • kind: noc_router, impl: forwarding_v1
  • pos_mm: taken from cube_mesh.yaml

Based on the attach information in cube_mesh.yaml, components are connected to each router:

  • pe{p}.dma → PE_DMA ↔ router edge
  • pe{p}.cpu → PE_CPU ↔ router edge
  • pe{p}.hbm → HBM_CTRL ↔ router edge (added in n:1)
  • m_cpu → M_CPU ↔ router edge
  • sram → SRAM ↔ router edge
  • ucie_{dir}.c{i} → UCIe conn ↔ router edge

Router-to-router XY mesh edges: bidirectional edges between adjacent routers. Null routers (HBM exclusion zones) are skipped.

1:1 Mode Extension (To Be Implemented Later)

In 1:1 mode, each router differentiates into N channel mini-routers. Per-channel routing and ChannelSplitter (LA → per-channel PA) introduction are required. N GEMM engines per PE are also added at this point.


D4. Cross-PE HBM Access (n:1 Mode)

In n:1 mode, when a PE accesses another PE's local HBM, it hops through the XY mesh in cube_mesh.yaml to reach the target PE's router.

Example: PE0 (r0c0) accessing PE2's (r1c4) HBM:

PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl

The Dijkstra router finds the shortest path in the mesh.

Cross-PE channel access in 1:1 mode will be defined during the 1:1 extension in D3.


D5. n:1 Mode: Uses cube_mesh.yaml Router Mesh

In n:1 mode, no separate "aggregated router" is created. The existing router grid from cube_mesh.yaml serves that role.

Connection Structure

PE_DMA, PE_CPU, and HBM are all connected to the router where each PE is attached:

sip0.cube0.pe0.pe_dma ←→ sip0.cube0.r0c0  (bw: N × channel_bw_gbs)
sip0.cube0.hbm_ctrl   ←→ sip0.cube0.r0c0  (bw: N × channel_bw_gbs)

Routers are connected via XY mesh edges. PE's local HBM access goes directly from its own router (switching overhead only).

n:1 Mode Full Data Paths

Local HBM (0 hops):

PE0.pe_dma → r0c0 → hbm_ctrl  (switching overhead only)

Remote HBM (mesh hops):

PE0.pe_dma → r0c0 → r0c1 → ... → r1c4 → hbm_ctrl

M_CPU DMA:

M_CPU → r2c0 → (mesh hops) → r{x}c{y} → hbm_ctrl

D6. All Traffic Is Unified onto the Same Router Mesh

  • All memory accesses (DMA data) and commands (PE_CPU) use the same router mesh
  • Local access does not use a separate fast path (xbar)
  • Cross-cube (remote) access path:
PE_DMA → r{x}c{y} → (mesh hops) → ucie_conn → ucie-{PORT}
  → [UCIe link] → remote ucie → remote conn → remote r{x}c{y} → hbm_ctrl

UCIe connections maintain the existing structure, but both endpoints become mesh routers instead of xbars.

The number of UCIe lines is determined by BW ratio: ucie_lines_per_side = ceil(ucie_bw / noc_line_bw).


D7. AddressResolver Changes

Current AddressResolver.resolve():

# Current: HBM offset → pe_slice → "sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
pe_slice = PhysAddr.hbm_pe_id(addr.hbm_offset, self._slice_size_bytes)
return f"sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"

After change:

# Changed: HBM → single endpoint
return f"sip{s}.cube{c}.hbm_ctrl"

The pe_slice calculation is removed. In n:1 mode, PE_DMA directly accesses the hbm_ctrl attached to its own router.

resolver.resolve() is retained for external access (M_CPU DMA, etc.) and backward compatibility.


D8. topology.yaml Configuration Changes

Added Settings

cube:
  memory_map:
    hbm_mapping_mode: n_to_one          # one_to_one | n_to_one
    hbm_pseudo_channels: 64             # total pseudo channel count
    hbm_channels_per_pe: 8              # local channels per PE (= pseudo_channels / pes_per_cube)
    hbm_channel_bw_gbs: 32.0            # per-channel bandwidth (GB/s)
    hbm_total_gb_per_cube: 48           # retained

Removed Settings

# To be removed
links:
  xbar_to_hbm_bw_gbs: 256.0            # → replaced by channel_bw_gbs × channels_per_pe
  xbar_to_hbm_mm: 2.5                  # → replaced by ch_router_to_hbm_mm
  xbar_to_bridge_bw_gbs: 128.0         # → removed (no bridge)
  xbar_to_bridge_mm: 3.0               # → removed
  noc_to_xbar_bw_gbs: ...              # → removed
  noc_to_xbar_mm: ...                  # → removed
links:
  router_link_bw_gbs: 256.0            # XY mesh link BW between routers
  router_overhead_ns: 2.0              # router switching overhead
  pe_to_router_bw_gbs: 256.0           # PE_DMA ↔ router
  hbm_to_router_bw_gbs: 256.0          # HBM ↔ router (= N × channel_bw)

D9. Bandwidth Numerical Consistency

Configuration Value
pseudo channels per cube 64 (parameter)
PEs per cube 8 (parameter)
channels per PE (N) pseudo_channels / pes_per_cube = 8
per-channel BW 32 GB/s (parameter)
per-PE local BW N × 32 = 256 GB/s
cube total HBM BW 64 × 32 = 2048 GB/s

The effective BW per PE is identical in both modes:

  • 1:1 mode: N channel links × channel_bw_gbs = N × 32 = 256 GB/s
  • n:1 mode: 1 aggregated link = N × channel_bw_gbs = 256 GB/s

Consequences

Positive

  • The router mesh based on cube_mesh.yaml accurately reflects physical placement
  • In n:1 mode, the existing VA scheme is preserved, keeping transition costs low
  • Local / remote / command traffic is unified onto the same mesh, resulting in simplicity
  • Aligns well with graph compiler-based topology generation
  • Channel count and PE count are both parameterized, enabling testing of various configurations
  • 1:1 mode extension naturally follows through router differentiation

Negative

  • The number of SimPy nodes increases due to explicit router nodes (6x6 = up to 32 routers/cube)
  • Requires complete rewrite of existing xbar/bridge/single NOC-based tests
  • The internal contention model of TwoDMeshNocComponent needs to be replaced with a per-router model

Alternatives

A1. Retain Existing xbar + HBM Slices

  • Local/remote paths remain bifurcated
  • Cannot model at pseudo-channel granularity
  • Cannot switch between 1:1/n:1 modes
  • Topology structure always has 1:1 size
  • Expressing n:1 semantics via link aggregation is complex
  • No reduction in router node count

A3. Gradual Transition (Retain xbar + Add NOC Path)

  • Higher compatibility, but dual-path coexistence increases complexity
  • Since xbar removal is ultimately necessary, the intermediate step provides little value

Implementation Notes

topology/builder.py Change Details

Code to Remove (within current _instantiate_cube())

  • xbar_top, xbar_bot node creation (~line 495-508)
  • bridge.left, bridge.right node creation
  • noc ↔ xbar edge creation (~line 540-555)
  • xbar ↔ hbm_ctrl.slice edge creation (~line 510-538)
  • xbar ↔ bridge edge creation (~line 557-572)

Code to Add

1:1 mode:

N = hbm_channels_per_pe  # from topology config
total_ch = hbm_pseudo_channels

# Create channel router nodes
for ch_id in range(total_ch):
    pe_id = ch_id // N
    nodes[f"{cp}.ch_r{ch_id}"] = Node(
        id=f"{cp}.ch_r{ch_id}", kind="noc_router", impl="noc_v1",
        attrs={}, pos_mm=(...),  # horizontal row = ch_id % N
    )

# PE_DMA ↔ local channel router edges
for pe_id in range(pes_per_cube):
    for local_ch in range(N):
        ch_id = pe_id * N + local_ch
        edges.append(Edge(
            src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.ch_r{ch_id}",
            bw_gbs=channel_bw, kind="pe_to_ch_router", ...))
        edges.append(Edge(
            src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.pe{pe_id}.pe_dma",
            bw_gbs=channel_bw, kind="ch_router_to_pe", ...))

# Channel router ↔ hbm_ctrl edges
for ch_id in range(total_ch):
    edges.append(Edge(
        src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.hbm_ctrl",
        bw_gbs=channel_bw, kind="ch_router_to_hbm", ...))
    edges.append(Edge(
        src=f"{cp}.hbm_ctrl", dst=f"{cp}.ch_r{ch_id}",
        bw_gbs=channel_bw, kind="hbm_to_ch_router", ...))

# Horizontal line edges (same logical index)
for row in range(N):
    for p in range(pes_per_cube - 1):
        ch_a = p * N + row
        ch_b = (p + 1) * N + row
        edges.append(Edge(
            src=f"{cp}.ch_r{ch_a}", dst=f"{cp}.ch_r{ch_b}",
            bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...))
        edges.append(Edge(
            src=f"{cp}.ch_r{ch_b}", dst=f"{cp}.ch_r{ch_a}",
            bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...))

n:1 mode:

# Create aggregated router nodes
for pe_id in range(pes_per_cube):
    nodes[f"{cp}.pe{pe_id}.agg_router"] = Node(
        id=f"{cp}.pe{pe_id}.agg_router", kind="noc_router", impl="noc_v1",
        attrs={}, pos_mm=(...),
    )

agg_bw = N * channel_bw  # aggregated BW

# PE_DMA ↔ aggregated router
for pe_id in range(pes_per_cube):
    edges.append(Edge(
        src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.pe{pe_id}.agg_router",
        bw_gbs=agg_bw, kind="pe_to_agg_router", ...))
    edges.append(Edge(
        src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.pe{pe_id}.pe_dma",
        bw_gbs=agg_bw, kind="agg_router_to_pe", ...))

# Aggregated router ↔ hbm_ctrl
for pe_id in range(pes_per_cube):
    edges.append(Edge(
        src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.hbm_ctrl",
        bw_gbs=agg_bw, kind="agg_to_hbm", ...))
    edges.append(Edge(
        src=f"{cp}.hbm_ctrl", dst=f"{cp}.pe{pe_id}.agg_router",
        bw_gbs=agg_bw, kind="hbm_to_agg", ...))

# Horizontal links between aggregated routers
for p in range(pes_per_cube - 1):
    edges.append(Edge(
        src=f"{cp}.pe{p}.agg_router", dst=f"{cp}.pe{p+1}.agg_router",
        bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...))
    edges.append(Edge(
        src=f"{cp}.pe{p+1}.agg_router", dst=f"{cp}.pe{p}.agg_router",
        bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...))

Affected Existing Tests

Test File Impact
tests/test_topology_compile.py Remove xbar/bridge node references, add channel router verification
tests/test_topology_load.py Reflect topology.yaml configuration changes
tests/test_pe_components.py PE_DMA routing path changes
tests/test_sip_parallel.py Cross-PE access path changes
Cases that directly test xbar/bridge Remove

Test Requirements

  • Verify that requests are delivered via per-channel links in 1:1 mode
  • Verify that requests are delivered via the aggregated link in n:1 mode
  • Verify that topology is correctly generated in both modes:
    • 1:1: total_ch channel routers + per-PE links + horizontal links
    • n:1: pes_per_cube aggregated routers + per-PE links
  • Verify that effective BW is consistent across both modes for the same workload
  • Verify that horizontal line routing works for cross-PE access
  • Verify that routing through UCIe works for cross-cube access
  • Verify that topology generation is correct under parameter variations (channels_per_pe = 4, 8, 16, etc.)

  • ADR-0018 (LA + BAAW) → addressing-side integration
  • ADR-0017 (Cube NOC 2D Mesh) → this ADR replaces the xbar/bridge portion
  • ADR-0004 (Memory Semantics) → BW model redefinition
  • ADR-0014 (PE Internal Execution Model) → impact from PE_DMA path changes