687c98086d
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
(dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
retroactive docs pending verification.
Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
deleted; ADR-0019/0021 moved to adr-history with one-line stub status
Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
selection, flit-aware per-flit commit, async finalize, command-only
fallback path)
Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
"Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
(now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)
Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py
Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.
Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
(ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
292 lines
11 KiB
Markdown
292 lines
11 KiB
Markdown
# ADR-0017: Cube NOC and HBM Connectivity
|
||
|
||
## Status
|
||
|
||
Accepted
|
||
|
||
## Context
|
||
|
||
The CUBE-level NOC is a 2D router mesh that carries every intra-cube
|
||
request: PE-to-HBM data, PE-to-PE traffic, command paths
|
||
(M_CPU↔PE_CPU), shared SRAM access, and inter-cube UCIe traffic.
|
||
|
||
The CUBE's HBM is exposed through per-PE controller endpoints attached
|
||
to PE routers. This per-PE partitioning makes local-vs-remote HBM
|
||
distinguishable by mesh distance: a PE's own HBM partition sits at its
|
||
own router (switching overhead only); another PE's HBM partition is
|
||
reachable by mesh hops to that PE's router.
|
||
|
||
Two channel-mapping modes are supported in the design space:
|
||
|
||
- **n:1 (default, implemented)** — each PE's HBM partition aggregates
|
||
`channels_per_pe` pseudo-channels into one endpoint. Effective
|
||
per-PE BW = N × per-channel BW.
|
||
- **1:1 (future)** — each PE router decomposes into per-channel
|
||
mini-routers; per-channel BW contention is modeled directly.
|
||
|
||
In both modes the per-PE effective BW is identical; only the connectivity
|
||
granularity differs.
|
||
|
||
## Decision
|
||
|
||
### D1. 2D router mesh
|
||
|
||
Each cube contains a 2D mesh of NOC routers generated by `mesh_gen.py`.
|
||
|
||
- Node naming: `sip{S}.cube{C}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`).
|
||
- Implementation: `forwarding_v1`. NOC `overhead_ns = 0`.
|
||
- Default 6×6 grid (sized from PE corner placement + UCIe attachment
|
||
count); larger PE counts scale the grid up.
|
||
- HBM exclusion zone: center rows/columns are excluded where HBM die
|
||
physically occupies space (e.g., r2c2, r2c3, r3c2, r3c3 for a 6×6).
|
||
- Latency = Manhattan distance × `ns_per_mm`.
|
||
|
||
### D2. XY routing algorithm
|
||
|
||
Deterministic XY routing:
|
||
|
||
1. Horizontal segment: route from source X to destination X at source Y.
|
||
2. Vertical segment: route from destination X at source Y to destination Y.
|
||
|
||
Each directed segment carries a unique key:
|
||
|
||
- Horizontal: `("H", y_band, x_min, x_max, direction)`
|
||
- Vertical: `("V", x_band, y_min, y_max, direction)`
|
||
|
||
Grid positions are snapped to the router grid, excluding the HBM zone.
|
||
|
||
### D3. Per-segment contention model
|
||
|
||
Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions
|
||
sharing a segment (same row or column band, same direction) contend for
|
||
the resource — modelling link-level serialization in a wormhole-routed
|
||
mesh.
|
||
|
||
With no contention, NOC traversal latency equals Manhattan distance ×
|
||
`ns_per_mm`. Under contention, SimPy's resource scheduling adds queueing
|
||
delay.
|
||
|
||
### D4. NOC attachment points (per-PE HBM partition)
|
||
|
||
Every PE router carries three attachments: `pe{idx}.dma`, `pe{idx}.cpu`,
|
||
and `pe{idx}.hbm`. The last is the per-PE HBM controller endpoint —
|
||
`sip{S}.cube{C}.hbm_ctrl.pe{idx}` — which owns one slice of the cube's
|
||
HBM (one pseudo-channel group; see D8).
|
||
|
||
Other attachments:
|
||
|
||
- M_CPU and shared SRAM each occupy a dedicated edge router.
|
||
- UCIe endpoints (N/S/E/W) each expose 4 connection routers distributed
|
||
along that edge (see D6).
|
||
|
||
```text
|
||
UCIe-N (conn x4)
|
||
|
|
||
+---------+---+---+---------+
|
||
| | | |
|
||
PE0.dma ---+ r0c0 | ... | r0c5 +--- PE2.dma
|
||
PE0.cpu <--+ +hbm.pe0| | +hbm.pe2+--< PE2.cpu
|
||
| | | |
|
||
UCIe-W ----+ ... | [HBM] | ... +---- UCIe-E
|
||
(conn x4) | | zone | | (conn x4)
|
||
| r2c0 | | |
|
||
M_CPU <--->+ | | |
|
||
| r3c0 | | |
|
||
SRAM <---->+ | | |
|
||
| | | |
|
||
PE4.dma ---+ r4c0 | ... | r4c5 +--- PE6.dma
|
||
PE4.cpu <--+ +hbm.pe4| | +hbm.pe6+--< PE6.cpu
|
||
| | | |
|
||
+---------+---+---+---------+
|
||
|
|
||
UCIe-S (conn x4)
|
||
```
|
||
|
||
Per-PE HBM partitioning is the key invariant that makes local vs
|
||
cross-PE HBM distinguishable by mesh distance (see D7).
|
||
|
||
### D5. NOC edge bandwidths and distances
|
||
|
||
| Connection | BW (GB/s) | Distance | Notes |
|
||
| ----------------------------- | ---------- | ------------- | ------------------------------------------- |
|
||
| PE_DMA → NOC | 256.0 | Physical (PE) | Matches local-HBM aggregate BW |
|
||
| NOC → PE_CPU | — | 0.0 mm | Command path only |
|
||
| Router ↔ hbm_ctrl.pe{idx} | 256.0 | 0.0 mm | Per PE router; N × per-channel BW (see D8) |
|
||
| NOC ↔ M_CPU | — | 0.0 mm | Command path |
|
||
| NOC ↔ SRAM | 128.0 × 4 | 0.0 mm | 512 GB/s aggregate |
|
||
| NOC ↔ UCIe conn | 128.0 | 0.0 mm | Per connection; 4 conn per port |
|
||
|
||
`0.0 mm` distances reflect the distributed nature of the NOC; actual
|
||
traversal distance is computed via Manhattan distance within the router
|
||
grid.
|
||
|
||
### D6. UCIe decomposition and inter-cube traffic
|
||
|
||
Each of the 4 UCIe ports (N, S, E, W) decomposes into:
|
||
|
||
- 1 `ucie-{PORT}` node: UCIe protocol endpoint (`overhead = 8.0 ns`).
|
||
- 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe.
|
||
|
||
This decomposition gives 4 independent NOC↔UCIe connections per port,
|
||
each with 128 GB/s bandwidth (512 GB/s aggregate per port).
|
||
|
||
Inter-cube traffic path:
|
||
|
||
```text
|
||
Source: PE_DMA → NOC → conn{i} → ucie-{PORT}
|
||
[UCIe link: 512 GB/s, 1.0mm seam distance]
|
||
Target: ucie-{PORT} → conn{i} → r{x}c{y} → (mesh hops) → hbm_ctrl.pe{idx}
|
||
```
|
||
|
||
UCIe overhead (8.0 ns) is applied at each `ucie-{PORT}` node, so a full
|
||
crossing incurs 16 ns (TX port + RX port).
|
||
|
||
### D7. Data paths through the NOC
|
||
|
||
All intra-cube traffic uses the same router mesh — no separate fast
|
||
paths.
|
||
|
||
**Local HBM** (same PE's own partition; 0 mesh hops):
|
||
|
||
```text
|
||
PE_DMA → r{x}c{y} → hbm_ctrl.pe{idx} (switching overhead only)
|
||
```
|
||
|
||
**Cross-PE HBM within cube** (target PE's partition, reached by mesh):
|
||
|
||
```text
|
||
PE_DMA → r{x}c{y} → (mesh hops) → r{x'}c{y'} → hbm_ctrl.pe{idx'}
|
||
```
|
||
|
||
Example: PE0 (on `r0c0`) accessing PE2's HBM (PE2 on `r1c4`):
|
||
|
||
```text
|
||
PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl.pe2
|
||
```
|
||
|
||
Dijkstra computes the shortest path within the mesh.
|
||
|
||
**Cross-cube HBM** (UCIe traversal):
|
||
|
||
```text
|
||
PE_DMA → r{x}c{y} → conn → ucie-{PORT} → [seam] → ucie-{PORT'} → conn
|
||
→ r{x'}c{y'} → hbm_ctrl.pe{idx'}
|
||
```
|
||
|
||
**Kernel launch command to PE**:
|
||
|
||
```text
|
||
[from io_noc] → ucie → conn → r{x}c{y} → (mesh) → M_CPU → (mesh) → PE_CPU
|
||
```
|
||
|
||
**Shared SRAM access**:
|
||
|
||
```text
|
||
PE_DMA → r{x}c{y} → (mesh) → SRAM
|
||
```
|
||
|
||
### D8. HBM channel mapping mode
|
||
|
||
Channel mapping is configured at cube scope:
|
||
|
||
```yaml
|
||
cube:
|
||
memory_map:
|
||
hbm_mapping_mode: n_to_one # one_to_one | n_to_one
|
||
hbm_pseudo_channels: 64 # total pseudo-channel count
|
||
hbm_channels_per_pe: 8 # per-PE local channel count
|
||
hbm_channel_bw_gbs: 32.0 # per-channel bandwidth (GB/s)
|
||
hbm_slices_per_cube: 8 # number of per-PE partitions
|
||
hbm_total_gb_per_cube: 48
|
||
```
|
||
|
||
**n:1 mode (default, implemented).** Each PE's HBM partition is a single
|
||
endpoint `hbm_ctrl.pe{idx}` that aggregates `channels_per_pe` pseudo-
|
||
channels. The `Router ↔ hbm_ctrl.pe{idx}` link bandwidth equals
|
||
`channels_per_pe × hbm_channel_bw_gbs`. Pseudo-channels are assumed to
|
||
interleave; only aggregate per-PE BW is modeled. No separate aggregated
|
||
router node exists — the per-PE router itself serves that role.
|
||
|
||
**1:1 mode (future).** Each PE router decomposes into N channel
|
||
mini-routers; per-channel routing carries fully-resolved PA + channel ID.
|
||
A `ChannelSplitter` resolves a logical access to N per-channel physical
|
||
requests. Per-channel link models BW contention. Cross-PE channel
|
||
access semantics are deferred to the implementation ADR.
|
||
|
||
**BW math (defaults).**
|
||
|
||
| Parameter | Value |
|
||
| ---------------------------------- | -------------------------- |
|
||
| pseudo channels per cube | 64 (parameter) |
|
||
| PEs per cube | 8 (parameter) |
|
||
| channels per PE (N) | 64 / 8 = 8 |
|
||
| per-channel BW | 32 GB/s (parameter) |
|
||
| per-PE local BW | N × 32 = 256 GB/s |
|
||
| cube total HBM BW | 64 × 32 = 2048 GB/s |
|
||
|
||
Both modes give the same per-PE effective BW; only the request shape and
|
||
contention model differ.
|
||
|
||
### D9. AddressResolver — per-PE HBM endpoint
|
||
|
||
The address resolver decodes a PA's HBM offset to the owning PE's
|
||
partition:
|
||
|
||
```python
|
||
# policy/routing/router.py
|
||
hbm_slice_bytes = hbm_total_gb_per_cube * (1 << 30) // hbm_slices_per_cube
|
||
|
||
if addr.kind == "hbm":
|
||
pe_id = int(addr.hbm_offset) // hbm_slice_bytes
|
||
return f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
|
||
```
|
||
|
||
The pe_id computation is intrinsic to the routing layer (not a
|
||
topology-time concern). Any HBM PA falls within exactly one partition,
|
||
yielding deterministic routing.
|
||
|
||
External callers (e.g., M_CPU DMA, Memory R/W from PCIE_EP) follow the
|
||
same resolver path — there is no separate fast path.
|
||
|
||
### D10. Mesh generation parameters
|
||
|
||
`mesh_gen.py` produces `cube_mesh.yaml` from:
|
||
|
||
- `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner.
|
||
- `cube.geometry`: cube physical dimensions and HBM zone.
|
||
- `cube.ucie.n_connections`: determines router count for UCIe attachment.
|
||
|
||
Output `mesh_data` dictionary contains:
|
||
|
||
- Router grid with positions and HBM exclusion zones.
|
||
- PE-to-router attachments (`pe{idx}.dma`, `pe{idx}.cpu`, `pe{idx}.hbm`
|
||
per PE).
|
||
- UCIe-to-router attachments (N/S/E/W distributed across edge routers).
|
||
- M_CPU and SRAM router attachments.
|
||
|
||
## Consequences
|
||
|
||
- Local HBM (0 mesh hops, switching overhead only) and cross-PE HBM
|
||
(mesh hops) are naturally distinguishable, satisfying SPEC R5
|
||
(multi-domain communication) and ADR-0002 (no zero-latency end-to-end
|
||
paths).
|
||
- All cube-internal traffic routes through one mesh — single contention
|
||
model, single layout, single set of edge BWs.
|
||
- Per-PE HBM partitioning maps cleanly to the LA model (ADR-0011): each
|
||
PE's partition is the n:1 aggregate of its assigned pseudo-channels.
|
||
- 1:1 mode extension is structurally natural — split each PE router into
|
||
N channel routers.
|
||
- Mesh generation is fully parameterised by `topology.yaml`; PE/cube
|
||
geometry changes propagate without code edits.
|
||
|
||
## Links
|
||
|
||
- ADR-0002 (Routing distance, ordering, no zero-latency paths)
|
||
- ADR-0003 D3 (cube-level NOC definition — extended here)
|
||
- ADR-0004 (Memory semantics, local HBM)
|
||
- ADR-0011 (Memory addressing — LA model consumes per-PE partition)
|
||
- ADR-0014 D1 (PE_DMA egress via router mesh)
|
||
- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
|
||
- ADR-0016 (IOChiplet io_noc — analogous pattern at IO chiplet level)
|
||
- ADR-0033 (Latency model: per-PC parallelism, switch penalty)
|