kernbench2/docs/adr/ADR-0017-dev-cube-noc-and-hbm-connectivity.md

# ADR-0017: Cube NOC and HBM Connectivity

## Status

Accepted

## Context

The CUBE-level NOC is a 2D router mesh that carries every intra-cube
request: PE-to-HBM data, PE-to-PE traffic, command paths
(M_CPU↔PE_CPU), shared SRAM access, and inter-cube UCIe traffic.

The CUBE's HBM is exposed through per-PE controller endpoints attached
to PE routers. This per-PE partitioning makes local-vs-remote HBM
distinguishable by mesh distance: a PE's own HBM partition sits at its
own router (switching overhead only); another PE's HBM partition is
reachable by mesh hops to that PE's router.

Two channel-mapping modes are supported in the design space:

- **n:1 (default, implemented)** — each PE's HBM partition aggregates
  `channels_per_pe` pseudo-channels into one endpoint. Effective
  per-PE BW = N × per-channel BW.
- **1:1 (future)** — each PE router decomposes into per-channel
  mini-routers; per-channel BW contention is modeled directly.

In both modes the per-PE effective BW is identical; only the connectivity
granularity differs.

## Decision

### D1. 2D router mesh

Each cube contains a 2D mesh of NOC routers generated by `mesh_gen.py`.

- Node naming: `sip{S}.cube{C}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`).
- Implementation: `forwarding_v1`. NOC `overhead_ns = 0`.
- Default 6×6 grid (sized from PE corner placement + UCIe attachment
  count); larger PE counts scale the grid up.
- HBM exclusion zone: center rows/columns are excluded where HBM die
  physically occupies space (e.g., r2c2, r2c3, r3c2, r3c3 for a 6×6).
- Latency = Manhattan distance × `ns_per_mm`.

### D2. XY routing algorithm

Deterministic XY routing:

1. Horizontal segment: route from source X to destination X at source Y.
2. Vertical segment: route from destination X at source Y to destination Y.

Each directed segment carries a unique key:

- Horizontal: `("H", y_band, x_min, x_max, direction)`
- Vertical:   `("V", x_band, y_min, y_max, direction)`

Grid positions are snapped to the router grid, excluding the HBM zone.

### D3. Per-segment contention model

Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions
sharing a segment (same row or column band, same direction) contend for
the resource — modelling link-level serialization in a wormhole-routed
mesh.

With no contention, NOC traversal latency equals Manhattan distance ×
`ns_per_mm`. Under contention, SimPy's resource scheduling adds queueing
delay.

### D4. NOC attachment points (per-PE HBM partition)

Every PE router carries three attachments: `pe{idx}.dma`, `pe{idx}.cpu`,
and `pe{idx}.hbm`. The last is the per-PE HBM controller endpoint —
`sip{S}.cube{C}.hbm_ctrl.pe{idx}` — which owns one slice of the cube's
HBM (one pseudo-channel group; see D8).

Other attachments:

- M_CPU and shared SRAM each occupy a dedicated edge router.
- UCIe endpoints (N/S/E/W) each expose 4 connection routers distributed
  along that edge (see D6).

```text
                    UCIe-N (conn x4)
                         |
           +---------+---+---+---------+
           |         |       |         |
PE0.dma ---+  r0c0   |  ...  |  r0c5  +--- PE2.dma
PE0.cpu <--+ +hbm.pe0|       | +hbm.pe2+--< PE2.cpu
           |         |       |         |
UCIe-W ----+  ...    | [HBM] |  ...   +---- UCIe-E
(conn x4)  |         | zone  |         |  (conn x4)
           |  r2c0   |       |         |
M_CPU <--->+         |       |         |
           |  r3c0   |       |         |
SRAM <---->+         |       |         |
           |         |       |         |
PE4.dma ---+  r4c0   |  ...  |  r4c5  +--- PE6.dma
PE4.cpu <--+ +hbm.pe4|       | +hbm.pe6+--< PE6.cpu
           |         |       |         |
           +---------+---+---+---------+
                         |
                    UCIe-S (conn x4)
```

Per-PE HBM partitioning is the key invariant that makes local vs
cross-PE HBM distinguishable by mesh distance (see D7).

### D5. NOC edge bandwidths and distances

| Connection                    | BW (GB/s)  | Distance      | Notes                                       |
| ----------------------------- | ---------- | ------------- | ------------------------------------------- |
| PE_DMA → NOC                  | 256.0      | Physical (PE) | Matches local-HBM aggregate BW              |
| NOC → PE_CPU                  | —          | 0.0 mm        | Command path only                           |
| Router ↔ hbm_ctrl.pe{idx}     | 256.0      | 0.0 mm        | Per PE router; N × per-channel BW (see D8)  |
| NOC ↔ M_CPU                   | —          | 0.0 mm        | Command path                                |
| NOC ↔ SRAM                    | 128.0 × 4  | 0.0 mm        | 512 GB/s aggregate                          |
| NOC ↔ UCIe conn               | 128.0      | 0.0 mm        | Per connection; 4 conn per port             |

`0.0 mm` distances reflect the distributed nature of the NOC; actual
traversal distance is computed via Manhattan distance within the router
grid.

### D6. UCIe decomposition and inter-cube traffic

Each of the 4 UCIe ports (N, S, E, W) decomposes into:

- 1 `ucie-{PORT}` node: UCIe protocol endpoint (`overhead = 8.0 ns`).
- 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe.

This decomposition gives 4 independent NOC↔UCIe connections per port,
each with 128 GB/s bandwidth (512 GB/s aggregate per port).

Inter-cube traffic path:

```text
Source: PE_DMA → NOC → conn{i} → ucie-{PORT}
                  [UCIe link: 512 GB/s, 1.0mm seam distance]
Target: ucie-{PORT} → conn{i} → r{x}c{y} → (mesh hops) → hbm_ctrl.pe{idx}
```

UCIe overhead (8.0 ns) is applied at each `ucie-{PORT}` node, so a full
crossing incurs 16 ns (TX port + RX port).

### D7. Data paths through the NOC

All intra-cube traffic uses the same router mesh — no separate fast
paths.

**Local HBM** (same PE's own partition; 0 mesh hops):

```text
PE_DMA → r{x}c{y} → hbm_ctrl.pe{idx}   (switching overhead only)
```

**Cross-PE HBM within cube** (target PE's partition, reached by mesh):

```text
PE_DMA → r{x}c{y} → (mesh hops) → r{x'}c{y'} → hbm_ctrl.pe{idx'}
```

Example: PE0 (on `r0c0`) accessing PE2's HBM (PE2 on `r1c4`):

```text
PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl.pe2
```

Dijkstra computes the shortest path within the mesh.

**Cross-cube HBM** (UCIe traversal):

```text
PE_DMA → r{x}c{y} → conn → ucie-{PORT} → [seam] → ucie-{PORT'} → conn
       → r{x'}c{y'} → hbm_ctrl.pe{idx'}
```

**Kernel launch command to PE**:

```text
[from io_noc] → ucie → conn → r{x}c{y} → (mesh) → M_CPU → (mesh) → PE_CPU
```

**Shared SRAM access**:

```text
PE_DMA → r{x}c{y} → (mesh) → SRAM
```

### D8. HBM channel mapping mode

Channel mapping is configured at cube scope:

```yaml
cube:
  memory_map:
    hbm_mapping_mode: n_to_one       # one_to_one | n_to_one
    hbm_pseudo_channels: 64          # total pseudo-channel count
    hbm_channels_per_pe: 8           # per-PE local channel count
    hbm_channel_bw_gbs: 32.0         # per-channel bandwidth (GB/s)
    hbm_slices_per_cube: 8           # number of per-PE partitions
    hbm_total_gb_per_cube: 48
```

**n:1 mode (default, implemented).** Each PE's HBM partition is a single
endpoint `hbm_ctrl.pe{idx}` that aggregates `channels_per_pe` pseudo-
channels. The `Router ↔ hbm_ctrl.pe{idx}` link bandwidth equals
`channels_per_pe × hbm_channel_bw_gbs`. Pseudo-channels are assumed to
interleave; only aggregate per-PE BW is modeled. No separate aggregated
router node exists — the per-PE router itself serves that role.

**1:1 mode (future).** Each PE router decomposes into N channel
mini-routers; per-channel routing carries fully-resolved PA + channel ID.
A `ChannelSplitter` resolves a logical access to N per-channel physical
requests. Per-channel link models BW contention. Cross-PE channel
access semantics are deferred to the implementation ADR.

**BW math (defaults).**

| Parameter                          | Value                      |
| ---------------------------------- | -------------------------- |
| pseudo channels per cube           | 64 (parameter)             |
| PEs per cube                       | 8 (parameter)              |
| channels per PE (N)                | 64 / 8 = 8                 |
| per-channel BW                     | 32 GB/s (parameter)        |
| per-PE local BW                    | N × 32 = 256 GB/s          |
| cube total HBM BW                  | 64 × 32 = 2048 GB/s        |

Both modes give the same per-PE effective BW; only the request shape and
contention model differ.

### D9. AddressResolver — per-PE HBM endpoint

The address resolver decodes a PA's HBM offset to the owning PE's
partition:

```python
# policy/routing/router.py
hbm_slice_bytes = hbm_total_gb_per_cube * (1 << 30) // hbm_slices_per_cube

if addr.kind == "hbm":
    pe_id = int(addr.hbm_offset) // hbm_slice_bytes
    return f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
```

The pe_id computation is intrinsic to the routing layer (not a
topology-time concern). Any HBM PA falls within exactly one partition,
yielding deterministic routing.

External callers (e.g., M_CPU DMA, Memory R/W from PCIE_EP) follow the
same resolver path — there is no separate fast path.

### D10. Mesh generation parameters

`mesh_gen.py` produces `cube_mesh.yaml` from:

- `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner.
- `cube.geometry`: cube physical dimensions and HBM zone.
- `cube.ucie.n_connections`: determines router count for UCIe attachment.

Output `mesh_data` dictionary contains:

- Router grid with positions and HBM exclusion zones.
- PE-to-router attachments (`pe{idx}.dma`, `pe{idx}.cpu`, `pe{idx}.hbm`
  per PE).
- UCIe-to-router attachments (N/S/E/W distributed across edge routers).
- M_CPU and SRAM router attachments.

## Consequences

- Local HBM (0 mesh hops, switching overhead only) and cross-PE HBM
  (mesh hops) are naturally distinguishable, satisfying SPEC R5
  (multi-domain communication) and ADR-0002 (no zero-latency end-to-end
  paths).
- All cube-internal traffic routes through one mesh — single contention
  model, single layout, single set of edge BWs.
- Per-PE HBM partitioning maps cleanly to the LA model (ADR-0011): each
  PE's partition is the n:1 aggregate of its assigned pseudo-channels.
- 1:1 mode extension is structurally natural — split each PE router into
  N channel routers.
- Mesh generation is fully parameterised by `topology.yaml`; PE/cube
  geometry changes propagate without code edits.

## Links

- ADR-0002 (Routing distance, ordering, no zero-latency paths)
- ADR-0003 D3 (cube-level NOC definition — extended here)
- ADR-0004 (Memory semantics, local HBM)
- ADR-0011 (Memory addressing — LA model consumes per-PE partition)
- ADR-0014 D1 (PE_DMA egress via router mesh)
- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
- ADR-0016 (IOChiplet io_noc — analogous pattern at IO chiplet level)
- ADR-0033 (Latency model: per-PC parallelism, switch penalty)