Files
kernbench2/docs/adr/ADR-0017-dev-cube-noc-and-hbm-connectivity.md
T
ywkang 687c98086d ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
  (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
  docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
  docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
  retroactive docs pending verification.

Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
  TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
  Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
  deleted; ADR-0019/0021 moved to adr-history with one-line stub status

Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
  serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
  per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
  target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
  selection, flit-aware per-flit commit, async finalize, command-only
  fallback path)

Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
  "Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
  block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
  ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
  (now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)

Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
  ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py

Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.

Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
  (ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:15:55 -07:00

11 KiB
Raw Blame History

ADR-0017: Cube NOC and HBM Connectivity

Status

Accepted

Context

The CUBE-level NOC is a 2D router mesh that carries every intra-cube request: PE-to-HBM data, PE-to-PE traffic, command paths (M_CPU↔PE_CPU), shared SRAM access, and inter-cube UCIe traffic.

The CUBE's HBM is exposed through per-PE controller endpoints attached to PE routers. This per-PE partitioning makes local-vs-remote HBM distinguishable by mesh distance: a PE's own HBM partition sits at its own router (switching overhead only); another PE's HBM partition is reachable by mesh hops to that PE's router.

Two channel-mapping modes are supported in the design space:

  • n:1 (default, implemented) — each PE's HBM partition aggregates channels_per_pe pseudo-channels into one endpoint. Effective per-PE BW = N × per-channel BW.
  • 1:1 (future) — each PE router decomposes into per-channel mini-routers; per-channel BW contention is modeled directly.

In both modes the per-PE effective BW is identical; only the connectivity granularity differs.

Decision

D1. 2D router mesh

Each cube contains a 2D mesh of NOC routers generated by mesh_gen.py.

  • Node naming: sip{S}.cube{C}.r{row}c{col} (e.g., sip0.cube0.r0c0).
  • Implementation: forwarding_v1. NOC overhead_ns = 0.
  • Default 6×6 grid (sized from PE corner placement + UCIe attachment count); larger PE counts scale the grid up.
  • HBM exclusion zone: center rows/columns are excluded where HBM die physically occupies space (e.g., r2c2, r2c3, r3c2, r3c3 for a 6×6).
  • Latency = Manhattan distance × ns_per_mm.

D2. XY routing algorithm

Deterministic XY routing:

  1. Horizontal segment: route from source X to destination X at source Y.
  2. Vertical segment: route from destination X at source Y to destination Y.

Each directed segment carries a unique key:

  • Horizontal: ("H", y_band, x_min, x_max, direction)
  • Vertical: ("V", x_band, y_min, y_max, direction)

Grid positions are snapped to the router grid, excluding the HBM zone.

D3. Per-segment contention model

Each directed XY segment is a simpy.Resource(capacity=1). Transactions sharing a segment (same row or column band, same direction) contend for the resource — modelling link-level serialization in a wormhole-routed mesh.

With no contention, NOC traversal latency equals Manhattan distance × ns_per_mm. Under contention, SimPy's resource scheduling adds queueing delay.

D4. NOC attachment points (per-PE HBM partition)

Every PE router carries three attachments: pe{idx}.dma, pe{idx}.cpu, and pe{idx}.hbm. The last is the per-PE HBM controller endpoint — sip{S}.cube{C}.hbm_ctrl.pe{idx} — which owns one slice of the cube's HBM (one pseudo-channel group; see D8).

Other attachments:

  • M_CPU and shared SRAM each occupy a dedicated edge router.
  • UCIe endpoints (N/S/E/W) each expose 4 connection routers distributed along that edge (see D6).
                    UCIe-N (conn x4)
                         |
           +---------+---+---+---------+
           |         |       |         |
PE0.dma ---+  r0c0   |  ...  |  r0c5  +--- PE2.dma
PE0.cpu <--+ +hbm.pe0|       | +hbm.pe2+--< PE2.cpu
           |         |       |         |
UCIe-W ----+  ...    | [HBM] |  ...   +---- UCIe-E
(conn x4)  |         | zone  |         |  (conn x4)
           |  r2c0   |       |         |
M_CPU <--->+         |       |         |
           |  r3c0   |       |         |
SRAM <---->+         |       |         |
           |         |       |         |
PE4.dma ---+  r4c0   |  ...  |  r4c5  +--- PE6.dma
PE4.cpu <--+ +hbm.pe4|       | +hbm.pe6+--< PE6.cpu
           |         |       |         |
           +---------+---+---+---------+
                         |
                    UCIe-S (conn x4)

Per-PE HBM partitioning is the key invariant that makes local vs cross-PE HBM distinguishable by mesh distance (see D7).

D5. NOC edge bandwidths and distances

Connection BW (GB/s) Distance Notes
PE_DMA → NOC 256.0 Physical (PE) Matches local-HBM aggregate BW
NOC → PE_CPU 0.0 mm Command path only
Router ↔ hbm_ctrl.pe{idx} 256.0 0.0 mm Per PE router; N × per-channel BW (see D8)
NOC ↔ M_CPU 0.0 mm Command path
NOC ↔ SRAM 128.0 × 4 0.0 mm 512 GB/s aggregate
NOC ↔ UCIe conn 128.0 0.0 mm Per connection; 4 conn per port

0.0 mm distances reflect the distributed nature of the NOC; actual traversal distance is computed via Manhattan distance within the router grid.

D6. UCIe decomposition and inter-cube traffic

Each of the 4 UCIe ports (N, S, E, W) decomposes into:

  • 1 ucie-{PORT} node: UCIe protocol endpoint (overhead = 8.0 ns).
  • 4 ucie-{PORT}.conn{0-3} nodes: connection bridges between NOC and UCIe.

This decomposition gives 4 independent NOC↔UCIe connections per port, each with 128 GB/s bandwidth (512 GB/s aggregate per port).

Inter-cube traffic path:

Source: PE_DMA → NOC → conn{i} → ucie-{PORT}
                  [UCIe link: 512 GB/s, 1.0mm seam distance]
Target: ucie-{PORT} → conn{i} → r{x}c{y} → (mesh hops) → hbm_ctrl.pe{idx}

UCIe overhead (8.0 ns) is applied at each ucie-{PORT} node, so a full crossing incurs 16 ns (TX port + RX port).

D7. Data paths through the NOC

All intra-cube traffic uses the same router mesh — no separate fast paths.

Local HBM (same PE's own partition; 0 mesh hops):

PE_DMA → r{x}c{y} → hbm_ctrl.pe{idx}   (switching overhead only)

Cross-PE HBM within cube (target PE's partition, reached by mesh):

PE_DMA → r{x}c{y} → (mesh hops) → r{x'}c{y'} → hbm_ctrl.pe{idx'}

Example: PE0 (on r0c0) accessing PE2's HBM (PE2 on r1c4):

PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl.pe2

Dijkstra computes the shortest path within the mesh.

Cross-cube HBM (UCIe traversal):

PE_DMA → r{x}c{y} → conn → ucie-{PORT} → [seam] → ucie-{PORT'} → conn
       → r{x'}c{y'} → hbm_ctrl.pe{idx'}

Kernel launch command to PE:

[from io_noc] → ucie → conn → r{x}c{y} → (mesh) → M_CPU → (mesh) → PE_CPU

Shared SRAM access:

PE_DMA → r{x}c{y} → (mesh) → SRAM

D8. HBM channel mapping mode

Channel mapping is configured at cube scope:

cube:
  memory_map:
    hbm_mapping_mode: n_to_one       # one_to_one | n_to_one
    hbm_pseudo_channels: 64          # total pseudo-channel count
    hbm_channels_per_pe: 8           # per-PE local channel count
    hbm_channel_bw_gbs: 32.0         # per-channel bandwidth (GB/s)
    hbm_slices_per_cube: 8           # number of per-PE partitions
    hbm_total_gb_per_cube: 48

n:1 mode (default, implemented). Each PE's HBM partition is a single endpoint hbm_ctrl.pe{idx} that aggregates channels_per_pe pseudo- channels. The Router ↔ hbm_ctrl.pe{idx} link bandwidth equals channels_per_pe × hbm_channel_bw_gbs. Pseudo-channels are assumed to interleave; only aggregate per-PE BW is modeled. No separate aggregated router node exists — the per-PE router itself serves that role.

1:1 mode (future). Each PE router decomposes into N channel mini-routers; per-channel routing carries fully-resolved PA + channel ID. A ChannelSplitter resolves a logical access to N per-channel physical requests. Per-channel link models BW contention. Cross-PE channel access semantics are deferred to the implementation ADR.

BW math (defaults).

Parameter Value
pseudo channels per cube 64 (parameter)
PEs per cube 8 (parameter)
channels per PE (N) 64 / 8 = 8
per-channel BW 32 GB/s (parameter)
per-PE local BW N × 32 = 256 GB/s
cube total HBM BW 64 × 32 = 2048 GB/s

Both modes give the same per-PE effective BW; only the request shape and contention model differ.

D9. AddressResolver — per-PE HBM endpoint

The address resolver decodes a PA's HBM offset to the owning PE's partition:

# policy/routing/router.py
hbm_slice_bytes = hbm_total_gb_per_cube * (1 << 30) // hbm_slices_per_cube

if addr.kind == "hbm":
    pe_id = int(addr.hbm_offset) // hbm_slice_bytes
    return f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"

The pe_id computation is intrinsic to the routing layer (not a topology-time concern). Any HBM PA falls within exactly one partition, yielding deterministic routing.

External callers (e.g., M_CPU DMA, Memory R/W from PCIE_EP) follow the same resolver path — there is no separate fast path.

D10. Mesh generation parameters

mesh_gen.py produces cube_mesh.yaml from:

  • cube.pe_layout: corner placement (NW, NE, SW, SE) and PEs per corner.
  • cube.geometry: cube physical dimensions and HBM zone.
  • cube.ucie.n_connections: determines router count for UCIe attachment.

Output mesh_data dictionary contains:

  • Router grid with positions and HBM exclusion zones.
  • PE-to-router attachments (pe{idx}.dma, pe{idx}.cpu, pe{idx}.hbm per PE).
  • UCIe-to-router attachments (N/S/E/W distributed across edge routers).
  • M_CPU and SRAM router attachments.

Consequences

  • Local HBM (0 mesh hops, switching overhead only) and cross-PE HBM (mesh hops) are naturally distinguishable, satisfying SPEC R5 (multi-domain communication) and ADR-0002 (no zero-latency end-to-end paths).
  • All cube-internal traffic routes through one mesh — single contention model, single layout, single set of edge BWs.
  • Per-PE HBM partitioning maps cleanly to the LA model (ADR-0011): each PE's partition is the n:1 aggregate of its assigned pseudo-channels.
  • 1:1 mode extension is structurally natural — split each PE router into N channel routers.
  • Mesh generation is fully parameterised by topology.yaml; PE/cube geometry changes propagate without code edits.
  • ADR-0002 (Routing distance, ordering, no zero-latency paths)
  • ADR-0003 D3 (cube-level NOC definition — extended here)
  • ADR-0004 (Memory semantics, local HBM)
  • ADR-0011 (Memory addressing — LA model consumes per-PE partition)
  • ADR-0014 D1 (PE_DMA egress via router mesh)
  • ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
  • ADR-0016 (IOChiplet io_noc — analogous pattern at IO chiplet level)
  • ADR-0033 (Latency model: per-PC parallelism, switch penalty)