22fd0d2b9d
- CLAUDE.md: add ADR Lifecycle subsection (superseded → docs/history/, immutable numbering, no renumber) - ADR-0011: merge ADR-0018 content as "Address Model: LA" section alongside PA / VA; status notes VA model is currently implemented - ADR-0018 / 0029 / 0031: moved to docs/history/ with status updates (0018 merged into 0011, 0029 superseded by 0032, 0031 absorbed into 0001 rev 2) - ADR-0019: rewrite Context as PE-HBM connectivity decision (self-contained, no LA model framing) - ADR-0019/0020/0021/0023/0025/0027: Status Proposed → Accepted (code verified) and prune Implementation Notes / Affected files / Test strategy / "현재 상태" sub-sections describing pre-impl state - ADR-0024/0026: same migration-flavor cleanup; 0026 also drops D6 Migration and D8 docs-update sub-decisions - ADR-0030: status simplified (blocker ADR-0031 now superseded) - SPEC.md: R10 + §0.2 reflect PA / VA / LA model names - ADR-0008/0012/0013: refresh ADR-0011 subtitle in Links 21 files changed, 553 insertions(+), 1290 deletions(-). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
306 lines
9.7 KiB
Markdown
306 lines
9.7 KiB
Markdown
# ADR-0019: Per-Channel and Aggregated HBM Connection Models within CUBE NOC
|
||
|
||
## Status
|
||
|
||
Accepted
|
||
|
||
## Context
|
||
|
||
The CUBE-internal NOC must connect each PE to HBM. KernBench needs
|
||
to evaluate two connectivity models:
|
||
|
||
- **1:1 mode** — PE_DMA connects to N separate per-channel routers,
|
||
each with its own link to hbm_ctrl. Models per-channel BW
|
||
contention precisely.
|
||
N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`).
|
||
- **n:1 mode** — PE_DMA connects to a single aggregated router with
|
||
one link to hbm_ctrl. Channels are treated as interleaved; only
|
||
aggregate BW is modeled.
|
||
|
||
Effective PE-local BW is identical under both modes
|
||
(= N × per-channel BW); only the connectivity granularity differs.
|
||
|
||
---
|
||
|
||
## Decision
|
||
|
||
### D1. HBM Attaches to PE Routers
|
||
|
||
Consolidate the current `hbm_ctrl.slice{0-7}` (8 nodes) into a **single `hbm_ctrl` node**,
|
||
and attach the HBM access point to the same router where the PE is attached.
|
||
|
||
- n:1 mode: PE's local HBM access goes directly from its own router (switching overhead only, 0 hops)
|
||
- Remote PE's HBM access: reaches the target PE's router via mesh hops
|
||
- The read/write resource model within the HBM controller is preserved
|
||
|
||
Node naming changes:
|
||
|
||
| Current | After Change |
|
||
| ---- | ------- |
|
||
| `sip0.cube0.hbm_ctrl.slice0` ~ `slice7` | `sip0.cube0.hbm_ctrl` (single) |
|
||
|
||
In `mesh_gen.py`, add `pe{idx}.hbm` to the PE attachment so that
|
||
the builder generates an edge between that router and hbm_ctrl.
|
||
|
||
---
|
||
|
||
### D2. Complete Removal of xbar, bridge, and Single NOC Node
|
||
|
||
Remove all of the following nodes and related edges:
|
||
|
||
- `{cube}.xbar_top`, `{cube}.xbar_bot`
|
||
- `{cube}.bridge.left`, `{cube}.bridge.right`
|
||
- `{cube}.noc` (single TwoDMeshNocComponent node)
|
||
- Edges of type `noc_to_xbar`, `xbar_to_noc`, `xbar_to_hbm`, `hbm_to_xbar`
|
||
- Edges of type `xbar_to_bridge`, `bridge_to_xbar`
|
||
- Edges of type `pe_to_noc`, `noc_to_pe`, `noc_to_pe_cpu`, etc. referencing the single noc node
|
||
|
||
Their role is replaced by an **explicit router mesh based on cube_mesh.yaml**.
|
||
Each router (r0c0, r0c1, ...) from the 6x6 router grid generated by `mesh_gen.py`
|
||
is created as a separate SimPy node in the topology graph,
|
||
and adjacent routers are connected via XY mesh edges.
|
||
|
||
---
|
||
|
||
### D3. Explicit Router Mesh (Common Basis for n:1 / 1:1)
|
||
|
||
#### Router Nodes Based on cube_mesh.yaml
|
||
|
||
Each non-null router from cube_mesh.yaml generated by `mesh_gen.py`
|
||
is created as a **separate SimPy node** in the topology graph.
|
||
|
||
- Node ID: `{cube}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`)
|
||
- kind: `noc_router`, impl: `forwarding_v1`
|
||
- pos_mm: taken from cube_mesh.yaml
|
||
|
||
Based on the attach information in cube_mesh.yaml, components are connected to each router:
|
||
- `pe{p}.dma` → PE_DMA ↔ router edge
|
||
- `pe{p}.cpu` → PE_CPU ↔ router edge
|
||
- `pe{p}.hbm` → HBM_CTRL ↔ router edge (added in n:1)
|
||
- `m_cpu` → M_CPU ↔ router edge
|
||
- `sram` → SRAM ↔ router edge
|
||
- `ucie_{dir}.c{i}` → UCIe conn ↔ router edge
|
||
|
||
Router-to-router XY mesh edges: bidirectional edges between adjacent routers.
|
||
Null routers (HBM exclusion zones) are skipped.
|
||
|
||
#### 1:1 Mode Extension (To Be Implemented Later)
|
||
|
||
In 1:1 mode, each router differentiates into N channel mini-routers.
|
||
Per-channel routing and ChannelSplitter (LA → per-channel PA) introduction are required.
|
||
N GEMM engines per PE are also added at this point.
|
||
|
||
---
|
||
|
||
### D4. Cross-PE HBM Access (n:1 Mode)
|
||
|
||
In n:1 mode, when a PE accesses another PE's local HBM,
|
||
it hops through the XY mesh in cube_mesh.yaml to reach the target PE's router.
|
||
|
||
Example: PE0 (r0c0) accessing PE2's (r1c4) HBM:
|
||
|
||
```text
|
||
PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl
|
||
```
|
||
|
||
The Dijkstra router finds the shortest path in the mesh.
|
||
|
||
Cross-PE channel access in 1:1 mode will be defined during the 1:1 extension in D3.
|
||
|
||
---
|
||
|
||
### D5. n:1 Mode: Uses cube_mesh.yaml Router Mesh
|
||
|
||
In n:1 mode, no separate "aggregated router" is created.
|
||
The existing router grid from cube_mesh.yaml serves that role.
|
||
|
||
#### Connection Structure
|
||
|
||
PE_DMA, PE_CPU, and HBM are all connected to the router where each PE is attached:
|
||
|
||
```text
|
||
sip0.cube0.pe0.pe_dma ←→ sip0.cube0.r0c0 (bw: N × channel_bw_gbs)
|
||
sip0.cube0.hbm_ctrl ←→ sip0.cube0.r0c0 (bw: N × channel_bw_gbs)
|
||
```
|
||
|
||
Routers are connected via XY mesh edges. PE's local HBM access goes
|
||
directly from its own router (switching overhead only).
|
||
|
||
#### n:1 Mode Full Data Paths
|
||
|
||
**Local HBM (0 hops):**
|
||
```text
|
||
PE0.pe_dma → r0c0 → hbm_ctrl (switching overhead only)
|
||
```
|
||
|
||
**Remote HBM (mesh hops):**
|
||
```text
|
||
PE0.pe_dma → r0c0 → r0c1 → ... → r1c4 → hbm_ctrl
|
||
```
|
||
|
||
**M_CPU DMA:**
|
||
```text
|
||
M_CPU → r2c0 → (mesh hops) → r{x}c{y} → hbm_ctrl
|
||
```
|
||
|
||
---
|
||
|
||
### D6. All Traffic Is Unified onto the Same Router Mesh
|
||
|
||
- All memory accesses (DMA data) and commands (PE_CPU) use the same router mesh
|
||
- Local access does not use a separate fast path (xbar)
|
||
- Cross-cube (remote) access path:
|
||
|
||
```text
|
||
PE_DMA → r{x}c{y} → (mesh hops) → ucie_conn → ucie-{PORT}
|
||
→ [UCIe link] → remote ucie → remote conn → remote r{x}c{y} → hbm_ctrl
|
||
```
|
||
|
||
UCIe connections maintain the existing structure,
|
||
but both endpoints become mesh routers instead of xbars.
|
||
|
||
The number of UCIe lines is determined by BW ratio: `ucie_lines_per_side = ceil(ucie_bw / noc_line_bw)`.
|
||
|
||
---
|
||
|
||
### D7. AddressResolver Changes
|
||
|
||
Current `AddressResolver.resolve()`:
|
||
|
||
```python
|
||
# Current: HBM offset → pe_slice → "sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
|
||
pe_slice = PhysAddr.hbm_pe_id(addr.hbm_offset, self._slice_size_bytes)
|
||
return f"sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
|
||
```
|
||
|
||
After change:
|
||
|
||
```python
|
||
# Changed: HBM → single endpoint
|
||
return f"sip{s}.cube{c}.hbm_ctrl"
|
||
```
|
||
|
||
The pe_slice calculation is removed.
|
||
In n:1 mode, PE_DMA directly accesses the hbm_ctrl attached to its own router.
|
||
|
||
resolver.resolve() is retained for external access (M_CPU DMA, etc.) and backward compatibility.
|
||
|
||
---
|
||
|
||
### D8. topology.yaml Configuration Changes
|
||
|
||
#### Added Settings
|
||
|
||
```yaml
|
||
cube:
|
||
memory_map:
|
||
hbm_mapping_mode: n_to_one # one_to_one | n_to_one
|
||
hbm_pseudo_channels: 64 # total pseudo channel count
|
||
hbm_channels_per_pe: 8 # local channels per PE (= pseudo_channels / pes_per_cube)
|
||
hbm_channel_bw_gbs: 32.0 # per-channel bandwidth (GB/s)
|
||
hbm_total_gb_per_cube: 48 # retained
|
||
```
|
||
|
||
#### Removed Settings
|
||
|
||
```yaml
|
||
# To be removed
|
||
links:
|
||
xbar_to_hbm_bw_gbs: 256.0 # → replaced by channel_bw_gbs × channels_per_pe
|
||
xbar_to_hbm_mm: 2.5 # → replaced by ch_router_to_hbm_mm
|
||
xbar_to_bridge_bw_gbs: 128.0 # → removed (no bridge)
|
||
xbar_to_bridge_mm: 3.0 # → removed
|
||
noc_to_xbar_bw_gbs: ... # → removed
|
||
noc_to_xbar_mm: ... # → removed
|
||
```
|
||
|
||
#### Added Link Settings
|
||
|
||
```yaml
|
||
links:
|
||
router_link_bw_gbs: 256.0 # XY mesh link BW between routers
|
||
router_overhead_ns: 2.0 # router switching overhead
|
||
pe_to_router_bw_gbs: 256.0 # PE_DMA ↔ router
|
||
hbm_to_router_bw_gbs: 256.0 # HBM ↔ router (= N × channel_bw)
|
||
```
|
||
|
||
---
|
||
|
||
### D9. Bandwidth Numerical Consistency
|
||
|
||
| Configuration | Value |
|
||
| ---- | --- |
|
||
| pseudo channels per cube | 64 (parameter) |
|
||
| PEs per cube | 8 (parameter) |
|
||
| channels per PE (N) | `pseudo_channels / pes_per_cube` = 8 |
|
||
| per-channel BW | 32 GB/s (parameter) |
|
||
| per-PE local BW | N × 32 = 256 GB/s |
|
||
| cube total HBM BW | 64 × 32 = 2048 GB/s |
|
||
|
||
The effective BW per PE is identical in both modes:
|
||
|
||
- 1:1 mode: N channel links × channel_bw_gbs = N × 32 = 256 GB/s
|
||
- n:1 mode: 1 aggregated link = N × channel_bw_gbs = 256 GB/s
|
||
|
||
---
|
||
|
||
## Consequences
|
||
|
||
### Positive
|
||
|
||
- The router mesh based on cube_mesh.yaml accurately reflects physical placement
|
||
- In n:1 mode, the existing VA scheme is preserved, keeping transition costs low
|
||
- Local / remote / command traffic is unified onto the same mesh, resulting in simplicity
|
||
- Aligns well with graph compiler-based topology generation
|
||
- Channel count and PE count are both parameterized, enabling testing of various configurations
|
||
- 1:1 mode extension naturally follows through router differentiation
|
||
|
||
### Negative
|
||
|
||
- The number of SimPy nodes increases due to explicit router nodes (6x6 = up to 32 routers/cube)
|
||
- The internal contention model of TwoDMeshNocComponent needs to be replaced with a per-router model
|
||
|
||
---
|
||
|
||
## Alternatives
|
||
|
||
### A1. Retain Existing xbar + HBM Slices
|
||
|
||
- Local/remote paths remain bifurcated
|
||
- Cannot model at pseudo-channel granularity
|
||
- Cannot switch between 1:1/n:1 modes
|
||
|
||
### A2. Always Generate Per-Channel Links and Aggregate Only in n:1
|
||
|
||
- Topology structure always has 1:1 size
|
||
- Expressing n:1 semantics via link aggregation is complex
|
||
- No reduction in router node count
|
||
|
||
### A3. Gradual Transition (Retain xbar + Add NOC Path)
|
||
|
||
- Higher compatibility, but dual-path coexistence increases complexity
|
||
- Since xbar removal is ultimately necessary, the intermediate step provides little value
|
||
|
||
---
|
||
|
||
## Test Requirements
|
||
|
||
- Verify that requests are delivered via per-channel links in 1:1 mode
|
||
- Verify that requests are delivered via the aggregated link in n:1 mode
|
||
- Verify that topology is correctly generated in both modes:
|
||
- 1:1: `total_ch` channel routers + per-PE links + horizontal links
|
||
- n:1: `pes_per_cube` aggregated routers + per-PE links
|
||
- Verify that effective BW is consistent across both modes for the same workload
|
||
- Verify that horizontal line routing works for cross-PE access
|
||
- Verify that routing through UCIe works for cross-cube access
|
||
- Verify that topology generation is correct under parameter variations (channels_per_pe = 4, 8, 16, etc.)
|
||
|
||
---
|
||
|
||
## Links
|
||
|
||
- ADR-0011 (LA model) → addressing-side integration
|
||
- ADR-0017 (Cube NOC 2D Mesh) → this ADR replaces the xbar/bridge portion
|
||
- ADR-0004 (Memory Semantics) → BW model redefinition
|
||
- ADR-0014 (PE Internal Execution Model) → impact from PE_DMA path changes
|