Add CHANGES.md, README, update SPEC/ADRs for release 2

- CHANGES.md: detailed changelog for release 1 and 2 - README.md: full project docs with install, probe, run, test usage - SPEC.md: add ADR-0014~0017 references, update R7 for pcie_ep endpoint - ADR-0003: update NOC description to reference ADR-0017 - ADR-0004: add HBM efficiency factor (0.8) to BW guarantee contract - ADR-0014: status Proposed -> Accepted - ADR-0015: update D4 to M_CPU bypass for Memory R/W, add ADR-0016/0017 links - ADR-0016 (new): IOChiplet NOC and memory data path - ADR-0017 (new): Cube NOC 2D mesh architecture - Fix MD lint warnings (unfenced code blocks) across all docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 01:43:15 -07:00
parent d75da439c6
commit fc6abbc8ee
10 changed files with 613 additions and 65 deletions
@@ -0,0 +1,189 @@
+# ADR-0017: Cube NOC 2D Mesh Architecture
+
+## Status
+
+Accepted
+
+## Context
+
+ADR-0003 D3 defines the cube-level NOC as a "distributed on-die fabric" but
+does not specify the internal routing model, contention semantics, or
+attachment topology. The implementation uses a 2D mesh router grid with
+XY routing and per-segment contention modeling. This ADR formalizes that
+architecture.
+
+## Decision
+
+### D1. NOC node and router grid
+
+Each cube contains a single NOC topology node (`sip{S}.cube{C}.noc`)
+implemented as `noc_2d_mesh_v1`. Internally, the NOC models a 2D router
+grid generated by `mesh_gen.py`.
+
+Grid properties:
+
+- Default dimensions: 6x6 routers (derived from PE layout + UCIe connections)
+- Router naming: `r{row}c{col}` (e.g., `r0c0`, `r5c5`)
+- HBM exclusion zone: center rows/columns are excluded where HBM physically
+  occupies space (e.g., r2c2, r2c3, r3c2, r3c3)
+- Router positions are derived from physical PE corner placement and cube
+  geometry
+
+The NOC overhead_ns is 0.0. Latency is modeled by Manhattan distance
+traversal within the mesh (distance_mm x ns_per_mm).
+
+### D2. XY routing algorithm
+
+The NOC uses deterministic XY routing:
+
+1. Horizontal segment: route from source X to destination X at source Y
+2. Vertical segment: route from destination X at source Y to destination Y
+
+Each directed segment is identified by a unique link key:
+
+- Horizontal: `("H", y_band, x_min, x_max, direction)`
+- Vertical: `("V", x_band, y_min, y_max, direction)`
+
+Grid positions are snapped to the router grid, excluding the HBM zone.
+
+### D3. Contention model
+
+Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions
+sharing a segment (same row or column band, same direction) contend for the
+resource. This models link-level serialization in a wormhole-routed mesh.
+
+With no contention, NOC traversal latency equals the Manhattan distance
+multiplied by `ns_per_mm`. Under contention, additional queueing delay
+is added by SimPy's resource scheduling.
+
+### D4. NOC attachment points
+
+The NOC connects to all major cube-level components:
+
+```text
+                    UCIe-N (conn x4)
+                         |
+           +---------+---+---+---------+
+           |         |       |         |
+PE0.dma ---+  r0c0   |  ...  |  r0c5  +--- PE2.dma
+PE0.cpu <--+         |       |         +--< PE2.cpu
+           |         |       |         |
+UCIe-W ----+  ...    | [HBM] |  ...   +---- UCIe-E
+(conn x4)  |         | zone  |         |  (conn x4)
+           |  r2c0   |       |         |
+M_CPU <--->+         |       |         |
+           |  r3c0   |       |         |
+SRAM <---->+         |       |         |
+           |         |       |         |
+PE4.dma ---+  r4c0   |  ...  |  r4c5  +--- PE6.dma
+PE4.cpu <--+         |       |         +--< PE6.cpu
+           |         |       |         |
+           +---------+---+---+---------+
+                         |
+                    UCIe-S (conn x4)
+
+xbar_top attached to: r0c0, r0c1, r1c4, r1c5 (top-half PE routers)
+xbar_bot attached to: r4c0, r4c1, r5c4, r5c5 (bottom-half PE routers)
+```
+
+### D5. NOC edge bandwidths and distances
+
+| Connection | BW (GB/s) | Distance | Notes |
+| --- | --- | --- | --- |
+| PE_DMA -> NOC | 256.0 | Physical (PE pos) | Matches HBM slice BW |
+| NOC -> PE_CPU | - | 0.0 mm | Command path only |
+| NOC <-> xbar_top | 256.0 | 0.0 mm | Per xbar half |
+| NOC <-> xbar_bot | 256.0 | 0.0 mm | Per xbar half |
+| NOC <-> M_CPU | - | 0.0 mm | Command path |
+| NOC <-> SRAM | 128.0 x4 | 0.0 mm | 512 GB/s aggregate |
+| NOC <-> UCIe conn | 128.0 | 0.0 mm | Per connection, 4 per port |
+
+Distance 0.0 mm for most connections reflects the distributed nature of
+the NOC; the actual traversal distance is computed internally via Manhattan
+distance within the router grid.
+
+### D6. UCIe decomposition and inter-cube traffic
+
+Each cube has 4 UCIe ports (N, S, E, W). Each port is decomposed into:
+
+- 1 `ucie-{PORT}` node: UCIe protocol endpoint (overhead = 8.0 ns)
+- 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe
+
+This decomposition enables N=4 independent NOC-to-UCIe connections per port,
+each with 128 GB/s bandwidth. Total aggregate per port: 512 GB/s.
+
+Inter-cube traffic path:
+
+```text
+Source: PE_DMA -> NOC -> conn{i} -> ucie-{PORT}
+                    [UCIe link: 512 GB/s, 1.0mm seam distance]
+Target: ucie-{PORT} -> conn{i} -> NOC -> xbar -> HBM
+```
+
+UCIe overhead (8.0 ns) is applied at each ucie-{PORT} node, so a
+full crossing incurs 16 ns (TX port + RX port).
+
+### D7. Data paths through the NOC
+
+**PE DMA to local HBM (same half):**
+
+```text
+PE_DMA -> NOC -> xbar_top -> HBM_CTRL.slice{0-3}
+```
+
+**PE DMA to cross-half HBM:**
+
+```text
+PE_DMA -> NOC -> xbar_top -> bridge -> xbar_bot -> HBM_CTRL.slice{4-7}
+```
+
+**PE DMA to remote cube HBM:**
+
+```text
+PE_DMA -> NOC -> conn -> ucie-E -> [seam] -> ucie-W -> conn -> NOC -> xbar -> HBM
+```
+
+**Kernel Launch command to PE:**
+
+```text
+[from io_noc] -> ucie -> conn -> NOC -> M_CPU -> NOC -> PE_CPU
+```
+
+**Shared SRAM access:**
+
+```text
+PE_DMA -> NOC -> SRAM
+```
+
+### D8. Mesh generation
+
+The router grid is generated by `mesh_gen.py` based on:
+
+- `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner
+- `cube.geometry`: cube physical dimensions and HBM zone
+- `cube.ucie.n_connections`: determines router count for UCIe attachment
+
+The generator produces a `mesh_data` dictionary containing:
+
+- Router grid with positions and HBM exclusion zones
+- PE-to-router attachments (pe_dma, pe_cpu per PE)
+- UCIe-to-router attachments (N/S/E/W, distributed across edge routers)
+- M_CPU and SRAM router attachments
+- xbar_top/bot router assignments (top-half vs bottom-half PE routers)
+
+## Consequences
+
+- NOC provides position-aware routing with deterministic latency
+- Contention is captured per directed segment (not per-node)
+- All cube-internal traffic is explicitly routed through the NOC
+- HBM exclusion zone reflects physical die layout constraints
+- The mesh generation is fully parameterized by `topology.yaml`
+
+## Links
+
+- ADR-0003 D3 (cube-level NOC definition — extended by this ADR)
+- ADR-0004 D1 (PE DMA to local HBM path via xbar)
+- ADR-0004 D3 (cross-half HBM via bridge)
+- ADR-0014 D1 (PE_DMA dual egress: xbar for HBM, NOC for non-HBM)
+- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
+- ADR-0016 D1 (IOChiplet io_noc — analogous pattern at IO chiplet level)