Files
kernbench2/docs/adr/ADR-0019-NOC-Local HBM.en.md
T
ywkang 22fd0d2b9d ADR: introduce docs/history/, merge 0011+0018, prune migration cruft
- CLAUDE.md: add ADR Lifecycle subsection (superseded → docs/history/,
  immutable numbering, no renumber)
- ADR-0011: merge ADR-0018 content as "Address Model: LA" section
  alongside PA / VA; status notes VA model is currently implemented
- ADR-0018 / 0029 / 0031: moved to docs/history/ with status updates
  (0018 merged into 0011, 0029 superseded by 0032, 0031 absorbed
  into 0001 rev 2)
- ADR-0019: rewrite Context as PE-HBM connectivity decision
  (self-contained, no LA model framing)
- ADR-0019/0020/0021/0023/0025/0027: Status Proposed → Accepted
  (code verified) and prune Implementation Notes / Affected files /
  Test strategy / "현재 상태" sub-sections describing pre-impl state
- ADR-0024/0026: same migration-flavor cleanup; 0026 also drops D6
  Migration and D8 docs-update sub-decisions
- ADR-0030: status simplified (blocker ADR-0031 now superseded)
- SPEC.md: R10 + §0.2 reflect PA / VA / LA model names
- ADR-0008/0012/0013: refresh ADR-0011 subtitle in Links

21 files changed, 553 insertions(+), 1290 deletions(-).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 11:42:45 -07:00

306 lines
9.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0019: Per-Channel and Aggregated HBM Connection Models within CUBE NOC
## Status
Accepted
## Context
The CUBE-internal NOC must connect each PE to HBM. KernBench needs
to evaluate two connectivity models:
- **1:1 mode** — PE_DMA connects to N separate per-channel routers,
each with its own link to hbm_ctrl. Models per-channel BW
contention precisely.
N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`).
- **n:1 mode** — PE_DMA connects to a single aggregated router with
one link to hbm_ctrl. Channels are treated as interleaved; only
aggregate BW is modeled.
Effective PE-local BW is identical under both modes
(= N × per-channel BW); only the connectivity granularity differs.
---
## Decision
### D1. HBM Attaches to PE Routers
Consolidate the current `hbm_ctrl.slice{0-7}` (8 nodes) into a **single `hbm_ctrl` node**,
and attach the HBM access point to the same router where the PE is attached.
- n:1 mode: PE's local HBM access goes directly from its own router (switching overhead only, 0 hops)
- Remote PE's HBM access: reaches the target PE's router via mesh hops
- The read/write resource model within the HBM controller is preserved
Node naming changes:
| Current | After Change |
| ---- | ------- |
| `sip0.cube0.hbm_ctrl.slice0` ~ `slice7` | `sip0.cube0.hbm_ctrl` (single) |
In `mesh_gen.py`, add `pe{idx}.hbm` to the PE attachment so that
the builder generates an edge between that router and hbm_ctrl.
---
### D2. Complete Removal of xbar, bridge, and Single NOC Node
Remove all of the following nodes and related edges:
- `{cube}.xbar_top`, `{cube}.xbar_bot`
- `{cube}.bridge.left`, `{cube}.bridge.right`
- `{cube}.noc` (single TwoDMeshNocComponent node)
- Edges of type `noc_to_xbar`, `xbar_to_noc`, `xbar_to_hbm`, `hbm_to_xbar`
- Edges of type `xbar_to_bridge`, `bridge_to_xbar`
- Edges of type `pe_to_noc`, `noc_to_pe`, `noc_to_pe_cpu`, etc. referencing the single noc node
Their role is replaced by an **explicit router mesh based on cube_mesh.yaml**.
Each router (r0c0, r0c1, ...) from the 6x6 router grid generated by `mesh_gen.py`
is created as a separate SimPy node in the topology graph,
and adjacent routers are connected via XY mesh edges.
---
### D3. Explicit Router Mesh (Common Basis for n:1 / 1:1)
#### Router Nodes Based on cube_mesh.yaml
Each non-null router from cube_mesh.yaml generated by `mesh_gen.py`
is created as a **separate SimPy node** in the topology graph.
- Node ID: `{cube}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`)
- kind: `noc_router`, impl: `forwarding_v1`
- pos_mm: taken from cube_mesh.yaml
Based on the attach information in cube_mesh.yaml, components are connected to each router:
- `pe{p}.dma` → PE_DMA ↔ router edge
- `pe{p}.cpu` → PE_CPU ↔ router edge
- `pe{p}.hbm` → HBM_CTRL ↔ router edge (added in n:1)
- `m_cpu` → M_CPU ↔ router edge
- `sram` → SRAM ↔ router edge
- `ucie_{dir}.c{i}` → UCIe conn ↔ router edge
Router-to-router XY mesh edges: bidirectional edges between adjacent routers.
Null routers (HBM exclusion zones) are skipped.
#### 1:1 Mode Extension (To Be Implemented Later)
In 1:1 mode, each router differentiates into N channel mini-routers.
Per-channel routing and ChannelSplitter (LA → per-channel PA) introduction are required.
N GEMM engines per PE are also added at this point.
---
### D4. Cross-PE HBM Access (n:1 Mode)
In n:1 mode, when a PE accesses another PE's local HBM,
it hops through the XY mesh in cube_mesh.yaml to reach the target PE's router.
Example: PE0 (r0c0) accessing PE2's (r1c4) HBM:
```text
PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl
```
The Dijkstra router finds the shortest path in the mesh.
Cross-PE channel access in 1:1 mode will be defined during the 1:1 extension in D3.
---
### D5. n:1 Mode: Uses cube_mesh.yaml Router Mesh
In n:1 mode, no separate "aggregated router" is created.
The existing router grid from cube_mesh.yaml serves that role.
#### Connection Structure
PE_DMA, PE_CPU, and HBM are all connected to the router where each PE is attached:
```text
sip0.cube0.pe0.pe_dma ←→ sip0.cube0.r0c0 (bw: N × channel_bw_gbs)
sip0.cube0.hbm_ctrl ←→ sip0.cube0.r0c0 (bw: N × channel_bw_gbs)
```
Routers are connected via XY mesh edges. PE's local HBM access goes
directly from its own router (switching overhead only).
#### n:1 Mode Full Data Paths
**Local HBM (0 hops):**
```text
PE0.pe_dma → r0c0 → hbm_ctrl (switching overhead only)
```
**Remote HBM (mesh hops):**
```text
PE0.pe_dma → r0c0 → r0c1 → ... → r1c4 → hbm_ctrl
```
**M_CPU DMA:**
```text
M_CPU → r2c0 → (mesh hops) → r{x}c{y} → hbm_ctrl
```
---
### D6. All Traffic Is Unified onto the Same Router Mesh
- All memory accesses (DMA data) and commands (PE_CPU) use the same router mesh
- Local access does not use a separate fast path (xbar)
- Cross-cube (remote) access path:
```text
PE_DMA → r{x}c{y} → (mesh hops) → ucie_conn → ucie-{PORT}
→ [UCIe link] → remote ucie → remote conn → remote r{x}c{y} → hbm_ctrl
```
UCIe connections maintain the existing structure,
but both endpoints become mesh routers instead of xbars.
The number of UCIe lines is determined by BW ratio: `ucie_lines_per_side = ceil(ucie_bw / noc_line_bw)`.
---
### D7. AddressResolver Changes
Current `AddressResolver.resolve()`:
```python
# Current: HBM offset → pe_slice → "sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
pe_slice = PhysAddr.hbm_pe_id(addr.hbm_offset, self._slice_size_bytes)
return f"sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
```
After change:
```python
# Changed: HBM → single endpoint
return f"sip{s}.cube{c}.hbm_ctrl"
```
The pe_slice calculation is removed.
In n:1 mode, PE_DMA directly accesses the hbm_ctrl attached to its own router.
resolver.resolve() is retained for external access (M_CPU DMA, etc.) and backward compatibility.
---
### D8. topology.yaml Configuration Changes
#### Added Settings
```yaml
cube:
memory_map:
hbm_mapping_mode: n_to_one # one_to_one | n_to_one
hbm_pseudo_channels: 64 # total pseudo channel count
hbm_channels_per_pe: 8 # local channels per PE (= pseudo_channels / pes_per_cube)
hbm_channel_bw_gbs: 32.0 # per-channel bandwidth (GB/s)
hbm_total_gb_per_cube: 48 # retained
```
#### Removed Settings
```yaml
# To be removed
links:
xbar_to_hbm_bw_gbs: 256.0 # → replaced by channel_bw_gbs × channels_per_pe
xbar_to_hbm_mm: 2.5 # → replaced by ch_router_to_hbm_mm
xbar_to_bridge_bw_gbs: 128.0 # → removed (no bridge)
xbar_to_bridge_mm: 3.0 # → removed
noc_to_xbar_bw_gbs: ... # → removed
noc_to_xbar_mm: ... # → removed
```
#### Added Link Settings
```yaml
links:
router_link_bw_gbs: 256.0 # XY mesh link BW between routers
router_overhead_ns: 2.0 # router switching overhead
pe_to_router_bw_gbs: 256.0 # PE_DMA ↔ router
hbm_to_router_bw_gbs: 256.0 # HBM ↔ router (= N × channel_bw)
```
---
### D9. Bandwidth Numerical Consistency
| Configuration | Value |
| ---- | --- |
| pseudo channels per cube | 64 (parameter) |
| PEs per cube | 8 (parameter) |
| channels per PE (N) | `pseudo_channels / pes_per_cube` = 8 |
| per-channel BW | 32 GB/s (parameter) |
| per-PE local BW | N × 32 = 256 GB/s |
| cube total HBM BW | 64 × 32 = 2048 GB/s |
The effective BW per PE is identical in both modes:
- 1:1 mode: N channel links × channel_bw_gbs = N × 32 = 256 GB/s
- n:1 mode: 1 aggregated link = N × channel_bw_gbs = 256 GB/s
---
## Consequences
### Positive
- The router mesh based on cube_mesh.yaml accurately reflects physical placement
- In n:1 mode, the existing VA scheme is preserved, keeping transition costs low
- Local / remote / command traffic is unified onto the same mesh, resulting in simplicity
- Aligns well with graph compiler-based topology generation
- Channel count and PE count are both parameterized, enabling testing of various configurations
- 1:1 mode extension naturally follows through router differentiation
### Negative
- The number of SimPy nodes increases due to explicit router nodes (6x6 = up to 32 routers/cube)
- The internal contention model of TwoDMeshNocComponent needs to be replaced with a per-router model
---
## Alternatives
### A1. Retain Existing xbar + HBM Slices
- Local/remote paths remain bifurcated
- Cannot model at pseudo-channel granularity
- Cannot switch between 1:1/n:1 modes
### A2. Always Generate Per-Channel Links and Aggregate Only in n:1
- Topology structure always has 1:1 size
- Expressing n:1 semantics via link aggregation is complex
- No reduction in router node count
### A3. Gradual Transition (Retain xbar + Add NOC Path)
- Higher compatibility, but dual-path coexistence increases complexity
- Since xbar removal is ultimately necessary, the intermediate step provides little value
---
## Test Requirements
- Verify that requests are delivered via per-channel links in 1:1 mode
- Verify that requests are delivered via the aggregated link in n:1 mode
- Verify that topology is correctly generated in both modes:
- 1:1: `total_ch` channel routers + per-PE links + horizontal links
- n:1: `pes_per_cube` aggregated routers + per-PE links
- Verify that effective BW is consistent across both modes for the same workload
- Verify that horizontal line routing works for cross-PE access
- Verify that routing through UCIe works for cross-cube access
- Verify that topology generation is correct under parameter variations (channels_per_pe = 4, 8, 16, etc.)
---
## Links
- ADR-0011 (LA model) → addressing-side integration
- ADR-0017 (Cube NOC 2D Mesh) → this ADR replaces the xbar/bridge portion
- ADR-0004 (Memory Semantics) → BW model redefinition
- ADR-0014 (PE Internal Execution Model) → impact from PE_DMA path changes