Files
kernbench2/docs/adr/ADR-0019-NOC-Local HBM.en.md
T
ywkang b2c52f0e34 Add English translations for ADR-0018, 0019, 0020, 0021
- ADR-0018: LA-based memory address abstraction + BAAW + HBM channel mapping
- ADR-0019: CUBE NOC per-channel and aggregated HBM connection model
- ADR-0020: 2-pass data execution model (timing/data separation, greenlet)
- ADR-0021: PE pipeline refactor (component separation + token self-routing)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 16:31:32 -07:00

432 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0019: Per-Channel and Aggregated HBM Connection Models within CUBE NOC
## Status
Proposed
## Context
ADR-0018 introduced LA-based address abstraction and BAAW,
defining how a logical memory access is translated into the following two forms of requests:
- 1:1 mode: one logical access → N per-channel requests
- n:1 mode: one logical access → one aggregated request
Here N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`),
determined by topology parameters.
### Problems with the Existing Structure
In the current implementation (`topology/builder.py`):
- PE_DMA → NOC → xbar_top/xbar_bot → HBM_CTRL.slice{0-7} path is used
- HBM is modeled as 8 slice (= per-PE) nodes
- Local/remote access use different paths:
- local: NOC → xbar → HBM slice
- cross-half: NOC → xbar_top → bridge → xbar_bot → HBM slice
- remote cube: NOC → UCIe → remote NOC → remote xbar → remote HBM slice
Limitations of this structure:
- Cannot model at the pseudo-channel granularity (slice = per-PE granularity, not per-channel)
- xbar/bridge bifurcate local/remote paths
- Cannot express 1:1 / n:1 modes consistently
---
## Decision
### D1. HBM Attaches to PE Routers
Consolidate the current `hbm_ctrl.slice{0-7}` (8 nodes) into a **single `hbm_ctrl` node**,
and attach the HBM access point to the same router where the PE is attached.
- n:1 mode: PE's local HBM access goes directly from its own router (switching overhead only, 0 hops)
- Remote PE's HBM access: reaches the target PE's router via mesh hops
- The read/write resource model within the HBM controller is preserved
Node naming changes:
| Current | After Change |
| ---- | ------- |
| `sip0.cube0.hbm_ctrl.slice0` ~ `slice7` | `sip0.cube0.hbm_ctrl` (single) |
In `mesh_gen.py`, add `pe{idx}.hbm` to the PE attachment so that
the builder generates an edge between that router and hbm_ctrl.
---
### D2. Complete Removal of xbar, bridge, and Single NOC Node
Remove all of the following nodes and related edges:
- `{cube}.xbar_top`, `{cube}.xbar_bot`
- `{cube}.bridge.left`, `{cube}.bridge.right`
- `{cube}.noc` (single TwoDMeshNocComponent node)
- Edges of type `noc_to_xbar`, `xbar_to_noc`, `xbar_to_hbm`, `hbm_to_xbar`
- Edges of type `xbar_to_bridge`, `bridge_to_xbar`
- Edges of type `pe_to_noc`, `noc_to_pe`, `noc_to_pe_cpu`, etc. referencing the single noc node
Their role is replaced by an **explicit router mesh based on cube_mesh.yaml**.
Each router (r0c0, r0c1, ...) from the 6x6 router grid generated by `mesh_gen.py`
is created as a separate SimPy node in the topology graph,
and adjacent routers are connected via XY mesh edges.
---
### D3. Explicit Router Mesh (Common Basis for n:1 / 1:1)
#### Router Nodes Based on cube_mesh.yaml
Each non-null router from cube_mesh.yaml generated by `mesh_gen.py`
is created as a **separate SimPy node** in the topology graph.
- Node ID: `{cube}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`)
- kind: `noc_router`, impl: `forwarding_v1`
- pos_mm: taken from cube_mesh.yaml
Based on the attach information in cube_mesh.yaml, components are connected to each router:
- `pe{p}.dma` → PE_DMA ↔ router edge
- `pe{p}.cpu` → PE_CPU ↔ router edge
- `pe{p}.hbm` → HBM_CTRL ↔ router edge (added in n:1)
- `m_cpu` → M_CPU ↔ router edge
- `sram` → SRAM ↔ router edge
- `ucie_{dir}.c{i}` → UCIe conn ↔ router edge
Router-to-router XY mesh edges: bidirectional edges between adjacent routers.
Null routers (HBM exclusion zones) are skipped.
#### 1:1 Mode Extension (To Be Implemented Later)
In 1:1 mode, each router differentiates into N channel mini-routers.
Per-channel routing and ChannelSplitter (LA → per-channel PA) introduction are required.
N GEMM engines per PE are also added at this point.
---
### D4. Cross-PE HBM Access (n:1 Mode)
In n:1 mode, when a PE accesses another PE's local HBM,
it hops through the XY mesh in cube_mesh.yaml to reach the target PE's router.
Example: PE0 (r0c0) accessing PE2's (r1c4) HBM:
```text
PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl
```
The Dijkstra router finds the shortest path in the mesh.
Cross-PE channel access in 1:1 mode will be defined during the 1:1 extension in D3.
---
### D5. n:1 Mode: Uses cube_mesh.yaml Router Mesh
In n:1 mode, no separate "aggregated router" is created.
The existing router grid from cube_mesh.yaml serves that role.
#### Connection Structure
PE_DMA, PE_CPU, and HBM are all connected to the router where each PE is attached:
```text
sip0.cube0.pe0.pe_dma ←→ sip0.cube0.r0c0 (bw: N × channel_bw_gbs)
sip0.cube0.hbm_ctrl ←→ sip0.cube0.r0c0 (bw: N × channel_bw_gbs)
```
Routers are connected via XY mesh edges. PE's local HBM access goes
directly from its own router (switching overhead only).
#### n:1 Mode Full Data Paths
**Local HBM (0 hops):**
```text
PE0.pe_dma → r0c0 → hbm_ctrl (switching overhead only)
```
**Remote HBM (mesh hops):**
```text
PE0.pe_dma → r0c0 → r0c1 → ... → r1c4 → hbm_ctrl
```
**M_CPU DMA:**
```text
M_CPU → r2c0 → (mesh hops) → r{x}c{y} → hbm_ctrl
```
---
### D6. All Traffic Is Unified onto the Same Router Mesh
- All memory accesses (DMA data) and commands (PE_CPU) use the same router mesh
- Local access does not use a separate fast path (xbar)
- Cross-cube (remote) access path:
```text
PE_DMA → r{x}c{y} → (mesh hops) → ucie_conn → ucie-{PORT}
→ [UCIe link] → remote ucie → remote conn → remote r{x}c{y} → hbm_ctrl
```
UCIe connections maintain the existing structure,
but both endpoints become mesh routers instead of xbars.
The number of UCIe lines is determined by BW ratio: `ucie_lines_per_side = ceil(ucie_bw / noc_line_bw)`.
---
### D7. AddressResolver Changes
Current `AddressResolver.resolve()`:
```python
# Current: HBM offset → pe_slice → "sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
pe_slice = PhysAddr.hbm_pe_id(addr.hbm_offset, self._slice_size_bytes)
return f"sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
```
After change:
```python
# Changed: HBM → single endpoint
return f"sip{s}.cube{c}.hbm_ctrl"
```
The pe_slice calculation is removed.
In n:1 mode, PE_DMA directly accesses the hbm_ctrl attached to its own router.
resolver.resolve() is retained for external access (M_CPU DMA, etc.) and backward compatibility.
---
### D8. topology.yaml Configuration Changes
#### Added Settings
```yaml
cube:
memory_map:
hbm_mapping_mode: n_to_one # one_to_one | n_to_one
hbm_pseudo_channels: 64 # total pseudo channel count
hbm_channels_per_pe: 8 # local channels per PE (= pseudo_channels / pes_per_cube)
hbm_channel_bw_gbs: 32.0 # per-channel bandwidth (GB/s)
hbm_total_gb_per_cube: 48 # retained
```
#### Removed Settings
```yaml
# To be removed
links:
xbar_to_hbm_bw_gbs: 256.0 # → replaced by channel_bw_gbs × channels_per_pe
xbar_to_hbm_mm: 2.5 # → replaced by ch_router_to_hbm_mm
xbar_to_bridge_bw_gbs: 128.0 # → removed (no bridge)
xbar_to_bridge_mm: 3.0 # → removed
noc_to_xbar_bw_gbs: ... # → removed
noc_to_xbar_mm: ... # → removed
```
#### Added Link Settings
```yaml
links:
router_link_bw_gbs: 256.0 # XY mesh link BW between routers
router_overhead_ns: 2.0 # router switching overhead
pe_to_router_bw_gbs: 256.0 # PE_DMA ↔ router
hbm_to_router_bw_gbs: 256.0 # HBM ↔ router (= N × channel_bw)
```
---
### D9. Bandwidth Numerical Consistency
| Configuration | Value |
| ---- | --- |
| pseudo channels per cube | 64 (parameter) |
| PEs per cube | 8 (parameter) |
| channels per PE (N) | `pseudo_channels / pes_per_cube` = 8 |
| per-channel BW | 32 GB/s (parameter) |
| per-PE local BW | N × 32 = 256 GB/s |
| cube total HBM BW | 64 × 32 = 2048 GB/s |
The effective BW per PE is identical in both modes:
- 1:1 mode: N channel links × channel_bw_gbs = N × 32 = 256 GB/s
- n:1 mode: 1 aggregated link = N × channel_bw_gbs = 256 GB/s
---
## Consequences
### Positive
- The router mesh based on cube_mesh.yaml accurately reflects physical placement
- In n:1 mode, the existing VA scheme is preserved, keeping transition costs low
- Local / remote / command traffic is unified onto the same mesh, resulting in simplicity
- Aligns well with graph compiler-based topology generation
- Channel count and PE count are both parameterized, enabling testing of various configurations
- 1:1 mode extension naturally follows through router differentiation
### Negative
- The number of SimPy nodes increases due to explicit router nodes (6x6 = up to 32 routers/cube)
- Requires complete rewrite of existing xbar/bridge/single NOC-based tests
- The internal contention model of TwoDMeshNocComponent needs to be replaced with a per-router model
---
## Alternatives
### A1. Retain Existing xbar + HBM Slices
- Local/remote paths remain bifurcated
- Cannot model at pseudo-channel granularity
- Cannot switch between 1:1/n:1 modes
### A2. Always Generate Per-Channel Links and Aggregate Only in n:1
- Topology structure always has 1:1 size
- Expressing n:1 semantics via link aggregation is complex
- No reduction in router node count
### A3. Gradual Transition (Retain xbar + Add NOC Path)
- Higher compatibility, but dual-path coexistence increases complexity
- Since xbar removal is ultimately necessary, the intermediate step provides little value
---
## Implementation Notes
### topology/builder.py Change Details
#### Code to Remove (within current `_instantiate_cube()`)
- xbar_top, xbar_bot node creation (~line 495-508)
- bridge.left, bridge.right node creation
- noc ↔ xbar edge creation (~line 540-555)
- xbar ↔ hbm_ctrl.slice edge creation (~line 510-538)
- xbar ↔ bridge edge creation (~line 557-572)
#### Code to Add
1:1 mode:
```python
N = hbm_channels_per_pe # from topology config
total_ch = hbm_pseudo_channels
# Create channel router nodes
for ch_id in range(total_ch):
pe_id = ch_id // N
nodes[f"{cp}.ch_r{ch_id}"] = Node(
id=f"{cp}.ch_r{ch_id}", kind="noc_router", impl="noc_v1",
attrs={}, pos_mm=(...), # horizontal row = ch_id % N
)
# PE_DMA ↔ local channel router edges
for pe_id in range(pes_per_cube):
for local_ch in range(N):
ch_id = pe_id * N + local_ch
edges.append(Edge(
src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.ch_r{ch_id}",
bw_gbs=channel_bw, kind="pe_to_ch_router", ...))
edges.append(Edge(
src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.pe{pe_id}.pe_dma",
bw_gbs=channel_bw, kind="ch_router_to_pe", ...))
# Channel router ↔ hbm_ctrl edges
for ch_id in range(total_ch):
edges.append(Edge(
src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.hbm_ctrl",
bw_gbs=channel_bw, kind="ch_router_to_hbm", ...))
edges.append(Edge(
src=f"{cp}.hbm_ctrl", dst=f"{cp}.ch_r{ch_id}",
bw_gbs=channel_bw, kind="hbm_to_ch_router", ...))
# Horizontal line edges (same logical index)
for row in range(N):
for p in range(pes_per_cube - 1):
ch_a = p * N + row
ch_b = (p + 1) * N + row
edges.append(Edge(
src=f"{cp}.ch_r{ch_a}", dst=f"{cp}.ch_r{ch_b}",
bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...))
edges.append(Edge(
src=f"{cp}.ch_r{ch_b}", dst=f"{cp}.ch_r{ch_a}",
bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...))
```
n:1 mode:
```python
# Create aggregated router nodes
for pe_id in range(pes_per_cube):
nodes[f"{cp}.pe{pe_id}.agg_router"] = Node(
id=f"{cp}.pe{pe_id}.agg_router", kind="noc_router", impl="noc_v1",
attrs={}, pos_mm=(...),
)
agg_bw = N * channel_bw # aggregated BW
# PE_DMA ↔ aggregated router
for pe_id in range(pes_per_cube):
edges.append(Edge(
src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.pe{pe_id}.agg_router",
bw_gbs=agg_bw, kind="pe_to_agg_router", ...))
edges.append(Edge(
src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.pe{pe_id}.pe_dma",
bw_gbs=agg_bw, kind="agg_router_to_pe", ...))
# Aggregated router ↔ hbm_ctrl
for pe_id in range(pes_per_cube):
edges.append(Edge(
src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.hbm_ctrl",
bw_gbs=agg_bw, kind="agg_to_hbm", ...))
edges.append(Edge(
src=f"{cp}.hbm_ctrl", dst=f"{cp}.pe{pe_id}.agg_router",
bw_gbs=agg_bw, kind="hbm_to_agg", ...))
# Horizontal links between aggregated routers
for p in range(pes_per_cube - 1):
edges.append(Edge(
src=f"{cp}.pe{p}.agg_router", dst=f"{cp}.pe{p+1}.agg_router",
bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...))
edges.append(Edge(
src=f"{cp}.pe{p+1}.agg_router", dst=f"{cp}.pe{p}.agg_router",
bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...))
```
### Affected Existing Tests
| Test File | Impact |
| ---------- | ---- |
| `tests/test_topology_compile.py` | Remove xbar/bridge node references, add channel router verification |
| `tests/test_topology_load.py` | Reflect topology.yaml configuration changes |
| `tests/test_pe_components.py` | PE_DMA routing path changes |
| `tests/test_sip_parallel.py` | Cross-PE access path changes |
| Cases that directly test xbar/bridge | Remove |
---
## Test Requirements
- Verify that requests are delivered via per-channel links in 1:1 mode
- Verify that requests are delivered via the aggregated link in n:1 mode
- Verify that topology is correctly generated in both modes:
- 1:1: `total_ch` channel routers + per-PE links + horizontal links
- n:1: `pes_per_cube` aggregated routers + per-PE links
- Verify that effective BW is consistent across both modes for the same workload
- Verify that horizontal line routing works for cross-PE access
- Verify that routing through UCIe works for cross-cube access
- Verify that topology generation is correct under parameter variations (channels_per_pe = 4, 8, 16, etc.)
---
## Links
- ADR-0018 (LA + BAAW) → addressing-side integration
- ADR-0017 (Cube NOC 2D Mesh) → this ADR replaces the xbar/bridge portion
- ADR-0004 (Memory Semantics) → BW model redefinition
- ADR-0014 (PE Internal Execution Model) → impact from PE_DMA path changes