5917b3497c
- Remove xbar_top/bot, bridge, single noc node from topology
- Each cube_mesh.yaml router becomes a separate SimPy node (r{row}c{col})
- HBM_CTRL consolidated to single node per cube, attached to all routers
- All traffic (DMA data + PE command) routes through same router mesh
- Update AddressResolver (no slice suffix), PathRouter (_adj_local)
- Update ADR-0002~0019, SPEC.md to remove xbar/bridge references
- Regenerate SVG diagrams for new topology structure
- Skip cross-SIP PE_TCM and PE_MMU routing tests (not yet wired)
326 passed, 13 skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
190 lines
6.3 KiB
Markdown
190 lines
6.3 KiB
Markdown
# ADR-0017: Cube NOC 2D Mesh Architecture
|
|
|
|
## Status
|
|
|
|
Accepted
|
|
|
|
## Context
|
|
|
|
ADR-0003 D3 defines the cube-level NOC as a "distributed on-die fabric" but
|
|
does not specify the internal routing model, contention semantics, or
|
|
attachment topology. The implementation uses a 2D mesh router grid with
|
|
XY routing and per-segment contention modeling. This ADR formalizes that
|
|
architecture.
|
|
|
|
## Decision
|
|
|
|
### D1. NOC node and router grid
|
|
|
|
Each cube contains a 2D router mesh generated by `mesh_gen.py`.
|
|
Each router is a separate topology node (`sip{S}.cube{C}.r{row}c{col}`)
|
|
implemented as `forwarding_v1`. (Supersedes the original single-node
|
|
`noc_2d_mesh_v1` design — see ADR-0019.)
|
|
|
|
Grid properties:
|
|
|
|
- Default dimensions: 6x6 routers (derived from PE layout + UCIe connections)
|
|
- Router naming: `r{row}c{col}` (e.g., `r0c0`, `r5c5`)
|
|
- HBM exclusion zone: center rows/columns are excluded where HBM physically
|
|
occupies space (e.g., r2c2, r2c3, r3c2, r3c3)
|
|
- Router positions are derived from physical PE corner placement and cube
|
|
geometry
|
|
|
|
The NOC overhead_ns is 0.0. Latency is modeled by Manhattan distance
|
|
traversal within the mesh (distance_mm x ns_per_mm).
|
|
|
|
### D2. XY routing algorithm
|
|
|
|
The NOC uses deterministic XY routing:
|
|
|
|
1. Horizontal segment: route from source X to destination X at source Y
|
|
2. Vertical segment: route from destination X at source Y to destination Y
|
|
|
|
Each directed segment is identified by a unique link key:
|
|
|
|
- Horizontal: `("H", y_band, x_min, x_max, direction)`
|
|
- Vertical: `("V", x_band, y_min, y_max, direction)`
|
|
|
|
Grid positions are snapped to the router grid, excluding the HBM zone.
|
|
|
|
### D3. Contention model
|
|
|
|
Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions
|
|
sharing a segment (same row or column band, same direction) contend for the
|
|
resource. This models link-level serialization in a wormhole-routed mesh.
|
|
|
|
With no contention, NOC traversal latency equals the Manhattan distance
|
|
multiplied by `ns_per_mm`. Under contention, additional queueing delay
|
|
is added by SimPy's resource scheduling.
|
|
|
|
### D4. NOC attachment points
|
|
|
|
The NOC connects to all major cube-level components:
|
|
|
|
```text
|
|
UCIe-N (conn x4)
|
|
|
|
|
+---------+---+---+---------+
|
|
| | | |
|
|
PE0.dma ---+ r0c0 | ... | r0c5 +--- PE2.dma
|
|
PE0.cpu <--+ | | +--< PE2.cpu
|
|
| | | |
|
|
UCIe-W ----+ ... | [HBM] | ... +---- UCIe-E
|
|
(conn x4) | | zone | | (conn x4)
|
|
| r2c0 | | |
|
|
M_CPU <--->+ | | |
|
|
| r3c0 | | |
|
|
SRAM <---->+ | | |
|
|
| | | |
|
|
PE4.dma ---+ r4c0 | ... | r4c5 +--- PE6.dma
|
|
PE4.cpu <--+ | | +--< PE6.cpu
|
|
| | | |
|
|
+---------+---+---+---------+
|
|
|
|
|
UCIe-S (conn x4)
|
|
|
|
HBM attach: PE가 있는 라우터에 hbm_ctrl도 연결 (ADR-0019 D1)
|
|
(xbar_top/xbar_bot은 ADR-0019에 의해 제거됨)
|
|
```
|
|
|
|
### D5. NOC edge bandwidths and distances
|
|
|
|
| Connection | BW (GB/s) | Distance | Notes |
|
|
| --- | --- | --- | --- |
|
|
| PE_DMA -> NOC | 256.0 | Physical (PE pos) | Matches HBM slice BW |
|
|
| NOC -> PE_CPU | - | 0.0 mm | Command path only |
|
|
| Router <-> HBM_CTRL | 256.0 | 0.0 mm | Per PE router (ADR-0019) |
|
|
| NOC <-> M_CPU | - | 0.0 mm | Command path |
|
|
| NOC <-> SRAM | 128.0 x4 | 0.0 mm | 512 GB/s aggregate |
|
|
| NOC <-> UCIe conn | 128.0 | 0.0 mm | Per connection, 4 per port |
|
|
|
|
Distance 0.0 mm for most connections reflects the distributed nature of
|
|
the NOC; the actual traversal distance is computed internally via Manhattan
|
|
distance within the router grid.
|
|
|
|
### D6. UCIe decomposition and inter-cube traffic
|
|
|
|
Each cube has 4 UCIe ports (N, S, E, W). Each port is decomposed into:
|
|
|
|
- 1 `ucie-{PORT}` node: UCIe protocol endpoint (overhead = 8.0 ns)
|
|
- 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe
|
|
|
|
This decomposition enables N=4 independent NOC-to-UCIe connections per port,
|
|
each with 128 GB/s bandwidth. Total aggregate per port: 512 GB/s.
|
|
|
|
Inter-cube traffic path:
|
|
|
|
```text
|
|
Source: PE_DMA -> NOC -> conn{i} -> ucie-{PORT}
|
|
[UCIe link: 512 GB/s, 1.0mm seam distance]
|
|
Target: ucie-{PORT} -> conn{i} -> r{x}c{y} -> (mesh hops) -> hbm_ctrl
|
|
```
|
|
|
|
UCIe overhead (8.0 ns) is applied at each ucie-{PORT} node, so a
|
|
full crossing incurs 16 ns (TX port + RX port).
|
|
|
|
### D7. Data paths through the NOC
|
|
|
|
**PE DMA to local HBM (same half):**
|
|
|
|
```text
|
|
PE_DMA -> r{x}c{y} -> hbm_ctrl (local: 0 mesh hops, switching overhead only)
|
|
```
|
|
|
|
**PE DMA to remote PE's HBM:**
|
|
|
|
```text
|
|
PE_DMA -> r{x}c{y} -> (mesh hops) -> r{x'}c{y'} -> hbm_ctrl
|
|
```
|
|
|
|
**PE DMA to remote cube HBM:**
|
|
|
|
```text
|
|
PE_DMA -> r{x}c{y} -> conn -> ucie-E -> [seam] -> ucie-W -> conn -> r{x'}c{y'} -> hbm_ctrl
|
|
```
|
|
|
|
**Kernel Launch command to PE:**
|
|
|
|
```text
|
|
[from io_noc] -> ucie -> conn -> r{x}c{y} -> (mesh hops) -> M_CPU -> (mesh hops) -> PE_CPU
|
|
```
|
|
|
|
**Shared SRAM access:**
|
|
|
|
```text
|
|
PE_DMA -> r{x}c{y} -> (mesh hops) -> SRAM
|
|
```
|
|
|
|
### D8. Mesh generation
|
|
|
|
The router grid is generated by `mesh_gen.py` based on:
|
|
|
|
- `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner
|
|
- `cube.geometry`: cube physical dimensions and HBM zone
|
|
- `cube.ucie.n_connections`: determines router count for UCIe attachment
|
|
|
|
The generator produces a `mesh_data` dictionary containing:
|
|
|
|
- Router grid with positions and HBM exclusion zones
|
|
- PE-to-router attachments (pe_dma, pe_cpu per PE)
|
|
- UCIe-to-router attachments (N/S/E/W, distributed across edge routers)
|
|
- M_CPU and SRAM router attachments
|
|
- HBM attachment per PE router (ADR-0019)
|
|
|
|
## Consequences
|
|
|
|
- NOC provides position-aware routing with deterministic latency
|
|
- Contention is captured per directed segment (not per-node)
|
|
- All cube-internal traffic is explicitly routed through the NOC
|
|
- HBM exclusion zone reflects physical die layout constraints
|
|
- The mesh generation is fully parameterized by `topology.yaml`
|
|
|
|
## Links
|
|
|
|
- ADR-0003 D3 (cube-level NOC definition — extended by this ADR)
|
|
- ADR-0004 D1 (PE DMA to local HBM path via router mesh)
|
|
- ADR-0014 D1 (PE_DMA egress via router mesh)
|
|
- ADR-0019 (NOC-Local HBM — xbar/bridge 제거, 명시적 라우터 mesh)
|
|
- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
|
|
- ADR-0016 D1 (IOChiplet io_noc — analogous pattern at IO chiplet level)
|