ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037

Filename + lifecycle: - ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable. - ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2: docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft), docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for retroactive docs pending verification. Merges (one ADR per topic, no change-history annotations): - ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items) - ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl. TileToken self-routing and multi-op composite epilogue scope) - ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md deleted; ADR-0019/0021 moved to adr-history with one-line stub status Retroactive documentation (G4 closures, code-verified): - ADR-0037 forwarding component (TransitComponent: first-flit overhead, serial worker, path-based routing, single impl/multiple names) - ADR-0036 IO_CPU component (target_start_ns global barrier stamping, per-cube fan-out, response aggregation) - ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources, target_start_ns passthrough) - ADR-0034 HBM controller internal design (per-PC state, address-based selection, flit-aware per-flit commit, async finalize, command-only fallback path) Content updates: - ADR-0010 expanded to full CLI surface (run/probe/web), retitled "Command Line Interface and Execution Semantics" - ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned - ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata block replaced with standard Status header - ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4); ADR-0027 cleaned of supersession history - ADR-0033 D6 cleanup: address-based PC selection moved out of future-work (now documented in ADR-0034 D3); related D1/D3 wording realigned - Cross-references back-filled in 5 ADRs (G3 gaps closed) Onboarding docs split: - docs/onboarding/ created - moved: hw-architecture-overview.md, latency-model.md, di-presentation.md, ccl-author-guide{,.en}.md - references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8). No behavior change. Tooling: - tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py (ADR EN/KO pair invariant checker) - .claude/commands/report.md tracked (/report slash command) - .gitignore: allow .claude/commands/*.md while keeping settings files ignored Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:15:55 -07:00
parent 22fd0d2b9d
commit 687c98086d
97 changed files with 3286 additions and 3766 deletions
@@ -0,0 +1,291 @@
+# ADR-0017: Cube NOC and HBM Connectivity
+
+## Status
+
+Accepted
+
+## Context
+
+The CUBE-level NOC is a 2D router mesh that carries every intra-cube
+request: PE-to-HBM data, PE-to-PE traffic, command paths
+(M_CPU↔PE_CPU), shared SRAM access, and inter-cube UCIe traffic.
+
+The CUBE's HBM is exposed through per-PE controller endpoints attached
+to PE routers. This per-PE partitioning makes local-vs-remote HBM
+distinguishable by mesh distance: a PE's own HBM partition sits at its
+own router (switching overhead only); another PE's HBM partition is
+reachable by mesh hops to that PE's router.
+
+Two channel-mapping modes are supported in the design space:
+
+- **n:1 (default, implemented)** — each PE's HBM partition aggregates
+  `channels_per_pe` pseudo-channels into one endpoint. Effective
+  per-PE BW = N × per-channel BW.
+- **1:1 (future)** — each PE router decomposes into per-channel
+  mini-routers; per-channel BW contention is modeled directly.
+
+In both modes the per-PE effective BW is identical; only the connectivity
+granularity differs.
+
+## Decision
+
+### D1. 2D router mesh
+
+Each cube contains a 2D mesh of NOC routers generated by `mesh_gen.py`.
+
+- Node naming: `sip{S}.cube{C}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`).
+- Implementation: `forwarding_v1`. NOC `overhead_ns = 0`.
+- Default 6×6 grid (sized from PE corner placement + UCIe attachment
+  count); larger PE counts scale the grid up.
+- HBM exclusion zone: center rows/columns are excluded where HBM die
+  physically occupies space (e.g., r2c2, r2c3, r3c2, r3c3 for a 6×6).
+- Latency = Manhattan distance × `ns_per_mm`.
+
+### D2. XY routing algorithm
+
+Deterministic XY routing:
+
+1. Horizontal segment: route from source X to destination X at source Y.
+2. Vertical segment: route from destination X at source Y to destination Y.
+
+Each directed segment carries a unique key:
+
+- Horizontal: `("H", y_band, x_min, x_max, direction)`
+- Vertical:   `("V", x_band, y_min, y_max, direction)`
+
+Grid positions are snapped to the router grid, excluding the HBM zone.
+
+### D3. Per-segment contention model
+
+Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions
+sharing a segment (same row or column band, same direction) contend for
+the resource — modelling link-level serialization in a wormhole-routed
+mesh.
+
+With no contention, NOC traversal latency equals Manhattan distance ×
+`ns_per_mm`. Under contention, SimPy's resource scheduling adds queueing
+delay.
+
+### D4. NOC attachment points (per-PE HBM partition)
+
+Every PE router carries three attachments: `pe{idx}.dma`, `pe{idx}.cpu`,
+and `pe{idx}.hbm`. The last is the per-PE HBM controller endpoint —
+`sip{S}.cube{C}.hbm_ctrl.pe{idx}` — which owns one slice of the cube's
+HBM (one pseudo-channel group; see D8).
+
+Other attachments:
+
+- M_CPU and shared SRAM each occupy a dedicated edge router.
+- UCIe endpoints (N/S/E/W) each expose 4 connection routers distributed
+  along that edge (see D6).
+
+```text
+                    UCIe-N (conn x4)
+                         |
+           +---------+---+---+---------+
+           |         |       |         |
+PE0.dma ---+  r0c0   |  ...  |  r0c5  +--- PE2.dma
+PE0.cpu <--+ +hbm.pe0|       | +hbm.pe2+--< PE2.cpu
+           |         |       |         |
+UCIe-W ----+  ...    | [HBM] |  ...   +---- UCIe-E
+(conn x4)  |         | zone  |         |  (conn x4)
+           |  r2c0   |       |         |
+M_CPU <--->+         |       |         |
+           |  r3c0   |       |         |
+SRAM <---->+         |       |         |
+           |         |       |         |
+PE4.dma ---+  r4c0   |  ...  |  r4c5  +--- PE6.dma
+PE4.cpu <--+ +hbm.pe4|       | +hbm.pe6+--< PE6.cpu
+           |         |       |         |
+           +---------+---+---+---------+
+                         |
+                    UCIe-S (conn x4)
+```
+
+Per-PE HBM partitioning is the key invariant that makes local vs
+cross-PE HBM distinguishable by mesh distance (see D7).
+
+### D5. NOC edge bandwidths and distances
+
+| Connection                    | BW (GB/s)  | Distance      | Notes                                       |
+| ----------------------------- | ---------- | ------------- | ------------------------------------------- |
+| PE_DMA → NOC                  | 256.0      | Physical (PE) | Matches local-HBM aggregate BW              |
+| NOC → PE_CPU                  | —          | 0.0 mm        | Command path only                           |
+| Router ↔ hbm_ctrl.pe{idx}     | 256.0      | 0.0 mm        | Per PE router; N × per-channel BW (see D8)  |
+| NOC ↔ M_CPU                   | —          | 0.0 mm        | Command path                                |
+| NOC ↔ SRAM                    | 128.0 × 4  | 0.0 mm        | 512 GB/s aggregate                          |
+| NOC ↔ UCIe conn               | 128.0      | 0.0 mm        | Per connection; 4 conn per port             |
+
+`0.0 mm` distances reflect the distributed nature of the NOC; actual
+traversal distance is computed via Manhattan distance within the router
+grid.
+
+### D6. UCIe decomposition and inter-cube traffic
+
+Each of the 4 UCIe ports (N, S, E, W) decomposes into:
+
+- 1 `ucie-{PORT}` node: UCIe protocol endpoint (`overhead = 8.0 ns`).
+- 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe.
+
+This decomposition gives 4 independent NOC↔UCIe connections per port,
+each with 128 GB/s bandwidth (512 GB/s aggregate per port).
+
+Inter-cube traffic path:
+
+```text
+Source: PE_DMA → NOC → conn{i} → ucie-{PORT}
+                  [UCIe link: 512 GB/s, 1.0mm seam distance]
+Target: ucie-{PORT} → conn{i} → r{x}c{y} → (mesh hops) → hbm_ctrl.pe{idx}
+```
+
+UCIe overhead (8.0 ns) is applied at each `ucie-{PORT}` node, so a full
+crossing incurs 16 ns (TX port + RX port).
+
+### D7. Data paths through the NOC
+
+All intra-cube traffic uses the same router mesh — no separate fast
+paths.
+
+**Local HBM** (same PE's own partition; 0 mesh hops):
+
+```text
+PE_DMA → r{x}c{y} → hbm_ctrl.pe{idx}   (switching overhead only)
+```
+
+**Cross-PE HBM within cube** (target PE's partition, reached by mesh):
+
+```text
+PE_DMA → r{x}c{y} → (mesh hops) → r{x'}c{y'} → hbm_ctrl.pe{idx'}
+```
+
+Example: PE0 (on `r0c0`) accessing PE2's HBM (PE2 on `r1c4`):
+
+```text
+PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl.pe2
+```
+
+Dijkstra computes the shortest path within the mesh.
+
+**Cross-cube HBM** (UCIe traversal):
+
+```text
+PE_DMA → r{x}c{y} → conn → ucie-{PORT} → [seam] → ucie-{PORT'} → conn
+       → r{x'}c{y'} → hbm_ctrl.pe{idx'}
+```
+
+**Kernel launch command to PE**:
+
+```text
+[from io_noc] → ucie → conn → r{x}c{y} → (mesh) → M_CPU → (mesh) → PE_CPU
+```
+
+**Shared SRAM access**:
+
+```text
+PE_DMA → r{x}c{y} → (mesh) → SRAM
+```
+
+### D8. HBM channel mapping mode
+
+Channel mapping is configured at cube scope:
+
+```yaml
+cube:
+  memory_map:
+    hbm_mapping_mode: n_to_one       # one_to_one | n_to_one
+    hbm_pseudo_channels: 64          # total pseudo-channel count
+    hbm_channels_per_pe: 8           # per-PE local channel count
+    hbm_channel_bw_gbs: 32.0         # per-channel bandwidth (GB/s)
+    hbm_slices_per_cube: 8           # number of per-PE partitions
+    hbm_total_gb_per_cube: 48
+```
+
+**n:1 mode (default, implemented).** Each PE's HBM partition is a single
+endpoint `hbm_ctrl.pe{idx}` that aggregates `channels_per_pe` pseudo-
+channels. The `Router ↔ hbm_ctrl.pe{idx}` link bandwidth equals
+`channels_per_pe × hbm_channel_bw_gbs`. Pseudo-channels are assumed to
+interleave; only aggregate per-PE BW is modeled. No separate aggregated
+router node exists — the per-PE router itself serves that role.
+
+**1:1 mode (future).** Each PE router decomposes into N channel
+mini-routers; per-channel routing carries fully-resolved PA + channel ID.
+A `ChannelSplitter` resolves a logical access to N per-channel physical
+requests. Per-channel link models BW contention. Cross-PE channel
+access semantics are deferred to the implementation ADR.
+
+**BW math (defaults).**
+
+| Parameter                          | Value                      |
+| ---------------------------------- | -------------------------- |
+| pseudo channels per cube           | 64 (parameter)             |
+| PEs per cube                       | 8 (parameter)              |
+| channels per PE (N)                | 64 / 8 = 8                 |
+| per-channel BW                     | 32 GB/s (parameter)        |
+| per-PE local BW                    | N × 32 = 256 GB/s          |
+| cube total HBM BW                  | 64 × 32 = 2048 GB/s        |
+
+Both modes give the same per-PE effective BW; only the request shape and
+contention model differ.
+
+### D9. AddressResolver — per-PE HBM endpoint
+
+The address resolver decodes a PA's HBM offset to the owning PE's
+partition:
+
+```python
+# policy/routing/router.py
+hbm_slice_bytes = hbm_total_gb_per_cube * (1 << 30) // hbm_slices_per_cube
+
+if addr.kind == "hbm":
+    pe_id = int(addr.hbm_offset) // hbm_slice_bytes
+    return f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
+```
+
+The pe_id computation is intrinsic to the routing layer (not a
+topology-time concern). Any HBM PA falls within exactly one partition,
+yielding deterministic routing.
+
+External callers (e.g., M_CPU DMA, Memory R/W from PCIE_EP) follow the
+same resolver path — there is no separate fast path.
+
+### D10. Mesh generation parameters
+
+`mesh_gen.py` produces `cube_mesh.yaml` from:
+
+- `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner.
+- `cube.geometry`: cube physical dimensions and HBM zone.
+- `cube.ucie.n_connections`: determines router count for UCIe attachment.
+
+Output `mesh_data` dictionary contains:
+
+- Router grid with positions and HBM exclusion zones.
+- PE-to-router attachments (`pe{idx}.dma`, `pe{idx}.cpu`, `pe{idx}.hbm`
+  per PE).
+- UCIe-to-router attachments (N/S/E/W distributed across edge routers).
+- M_CPU and SRAM router attachments.
+
+## Consequences
+
+- Local HBM (0 mesh hops, switching overhead only) and cross-PE HBM
+  (mesh hops) are naturally distinguishable, satisfying SPEC R5
+  (multi-domain communication) and ADR-0002 (no zero-latency end-to-end
+  paths).
+- All cube-internal traffic routes through one mesh — single contention
+  model, single layout, single set of edge BWs.
+- Per-PE HBM partitioning maps cleanly to the LA model (ADR-0011): each
+  PE's partition is the n:1 aggregate of its assigned pseudo-channels.
+- 1:1 mode extension is structurally natural — split each PE router into
+  N channel routers.
+- Mesh generation is fully parameterised by `topology.yaml`; PE/cube
+  geometry changes propagate without code edits.
+
+## Links
+
+- ADR-0002 (Routing distance, ordering, no zero-latency paths)
+- ADR-0003 D3 (cube-level NOC definition — extended here)
+- ADR-0004 (Memory semantics, local HBM)
+- ADR-0011 (Memory addressing — LA model consumes per-PE partition)
+- ADR-0014 D1 (PE_DMA egress via router mesh)
+- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
+- ADR-0016 (IOChiplet io_noc — analogous pattern at IO chiplet level)
+- ADR-0033 (Latency model: per-PC parallelism, switch penalty)