Add English translations for ADR-0018, 0019, 0020, 0021

- ADR-0018: LA-based memory address abstraction + BAAW + HBM channel mapping - ADR-0019: CUBE NOC per-channel and aggregated HBM connection model - ADR-0020: 2-pass data execution model (timing/data separation, greenlet) - ADR-0021: PE pipeline refactor (component separation + token self-routing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 16:31:32 -07:00
parent 10b33b44ba
commit b2c52f0e34
4 changed files with 1962 additions and 0 deletions
@@ -0,0 +1,431 @@
+# ADR-0019: Per-Channel and Aggregated HBM Connection Models within CUBE NOC
+
+## Status
+
+Proposed
+
+## Context
+
+ADR-0018 introduced LA-based address abstraction and BAAW,
+defining how a logical memory access is translated into the following two forms of requests:
+
+- 1:1 mode: one logical access → N per-channel requests
+- n:1 mode: one logical access → one aggregated request
+
+Here N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`),
+determined by topology parameters.
+
+### Problems with the Existing Structure
+
+In the current implementation (`topology/builder.py`):
+
+- PE_DMA → NOC → xbar_top/xbar_bot → HBM_CTRL.slice{0-7} path is used
+- HBM is modeled as 8 slice (= per-PE) nodes
+- Local/remote access use different paths:
+  - local: NOC → xbar → HBM slice
+  - cross-half: NOC → xbar_top → bridge → xbar_bot → HBM slice
+  - remote cube: NOC → UCIe → remote NOC → remote xbar → remote HBM slice
+
+Limitations of this structure:
+
+- Cannot model at the pseudo-channel granularity (slice = per-PE granularity, not per-channel)
+- xbar/bridge bifurcate local/remote paths
+- Cannot express 1:1 / n:1 modes consistently
+
+---
+
+## Decision
+
+### D1. HBM Attaches to PE Routers
+
+Consolidate the current `hbm_ctrl.slice{0-7}` (8 nodes) into a **single `hbm_ctrl` node**,
+and attach the HBM access point to the same router where the PE is attached.
+
+- n:1 mode: PE's local HBM access goes directly from its own router (switching overhead only, 0 hops)
+- Remote PE's HBM access: reaches the target PE's router via mesh hops
+- The read/write resource model within the HBM controller is preserved
+
+Node naming changes:
+
+| Current | After Change |
+| ---- | ------- |
+| `sip0.cube0.hbm_ctrl.slice0` ~ `slice7` | `sip0.cube0.hbm_ctrl` (single) |
+
+In `mesh_gen.py`, add `pe{idx}.hbm` to the PE attachment so that
+the builder generates an edge between that router and hbm_ctrl.
+
+---
+
+### D2. Complete Removal of xbar, bridge, and Single NOC Node
+
+Remove all of the following nodes and related edges:
+
+- `{cube}.xbar_top`, `{cube}.xbar_bot`
+- `{cube}.bridge.left`, `{cube}.bridge.right`
+- `{cube}.noc` (single TwoDMeshNocComponent node)
+- Edges of type `noc_to_xbar`, `xbar_to_noc`, `xbar_to_hbm`, `hbm_to_xbar`
+- Edges of type `xbar_to_bridge`, `bridge_to_xbar`
+- Edges of type `pe_to_noc`, `noc_to_pe`, `noc_to_pe_cpu`, etc. referencing the single noc node
+
+Their role is replaced by an **explicit router mesh based on cube_mesh.yaml**.
+Each router (r0c0, r0c1, ...) from the 6x6 router grid generated by `mesh_gen.py`
+is created as a separate SimPy node in the topology graph,
+and adjacent routers are connected via XY mesh edges.
+
+---
+
+### D3. Explicit Router Mesh (Common Basis for n:1 / 1:1)
+
+#### Router Nodes Based on cube_mesh.yaml
+
+Each non-null router from cube_mesh.yaml generated by `mesh_gen.py`
+is created as a **separate SimPy node** in the topology graph.
+
+- Node ID: `{cube}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`)
+- kind: `noc_router`, impl: `forwarding_v1`
+- pos_mm: taken from cube_mesh.yaml
+
+Based on the attach information in cube_mesh.yaml, components are connected to each router:
+- `pe{p}.dma` → PE_DMA ↔ router edge
+- `pe{p}.cpu` → PE_CPU ↔ router edge
+- `pe{p}.hbm` → HBM_CTRL ↔ router edge (added in n:1)
+- `m_cpu` → M_CPU ↔ router edge
+- `sram` → SRAM ↔ router edge
+- `ucie_{dir}.c{i}` → UCIe conn ↔ router edge
+
+Router-to-router XY mesh edges: bidirectional edges between adjacent routers.
+Null routers (HBM exclusion zones) are skipped.
+
+#### 1:1 Mode Extension (To Be Implemented Later)
+
+In 1:1 mode, each router differentiates into N channel mini-routers.
+Per-channel routing and ChannelSplitter (LA → per-channel PA) introduction are required.
+N GEMM engines per PE are also added at this point.
+
+---
+
+### D4. Cross-PE HBM Access (n:1 Mode)
+
+In n:1 mode, when a PE accesses another PE's local HBM,
+it hops through the XY mesh in cube_mesh.yaml to reach the target PE's router.
+
+Example: PE0 (r0c0) accessing PE2's (r1c4) HBM:
+
+```text
+PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl
+```
+
+The Dijkstra router finds the shortest path in the mesh.
+
+Cross-PE channel access in 1:1 mode will be defined during the 1:1 extension in D3.
+
+---
+
+### D5. n:1 Mode: Uses cube_mesh.yaml Router Mesh
+
+In n:1 mode, no separate "aggregated router" is created.
+The existing router grid from cube_mesh.yaml serves that role.
+
+#### Connection Structure
+
+PE_DMA, PE_CPU, and HBM are all connected to the router where each PE is attached:
+
+```text
+sip0.cube0.pe0.pe_dma ←→ sip0.cube0.r0c0  (bw: N × channel_bw_gbs)
+sip0.cube0.hbm_ctrl   ←→ sip0.cube0.r0c0  (bw: N × channel_bw_gbs)
+```
+
+Routers are connected via XY mesh edges. PE's local HBM access goes
+directly from its own router (switching overhead only).
+
+#### n:1 Mode Full Data Paths
+
+**Local HBM (0 hops):**
+```text
+PE0.pe_dma → r0c0 → hbm_ctrl  (switching overhead only)
+```
+
+**Remote HBM (mesh hops):**
+```text
+PE0.pe_dma → r0c0 → r0c1 → ... → r1c4 → hbm_ctrl
+```
+
+**M_CPU DMA:**
+```text
+M_CPU → r2c0 → (mesh hops) → r{x}c{y} → hbm_ctrl
+```
+
+---
+
+### D6. All Traffic Is Unified onto the Same Router Mesh
+
+- All memory accesses (DMA data) and commands (PE_CPU) use the same router mesh
+- Local access does not use a separate fast path (xbar)
+- Cross-cube (remote) access path:
+
+```text
+PE_DMA → r{x}c{y} → (mesh hops) → ucie_conn → ucie-{PORT}
+  → [UCIe link] → remote ucie → remote conn → remote r{x}c{y} → hbm_ctrl
+```
+
+UCIe connections maintain the existing structure,
+but both endpoints become mesh routers instead of xbars.
+
+The number of UCIe lines is determined by BW ratio: `ucie_lines_per_side = ceil(ucie_bw / noc_line_bw)`.
+
+---
+
+### D7. AddressResolver Changes
+
+Current `AddressResolver.resolve()`:
+
+```python
+# Current: HBM offset → pe_slice → "sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
+pe_slice = PhysAddr.hbm_pe_id(addr.hbm_offset, self._slice_size_bytes)
+return f"sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
+```
+
+After change:
+
+```python
+# Changed: HBM → single endpoint
+return f"sip{s}.cube{c}.hbm_ctrl"
+```
+
+The pe_slice calculation is removed.
+In n:1 mode, PE_DMA directly accesses the hbm_ctrl attached to its own router.
+
+resolver.resolve() is retained for external access (M_CPU DMA, etc.) and backward compatibility.
+
+---
+
+### D8. topology.yaml Configuration Changes
+
+#### Added Settings
+
+```yaml
+cube:
+  memory_map:
+    hbm_mapping_mode: n_to_one          # one_to_one | n_to_one
+    hbm_pseudo_channels: 64             # total pseudo channel count
+    hbm_channels_per_pe: 8              # local channels per PE (= pseudo_channels / pes_per_cube)
+    hbm_channel_bw_gbs: 32.0            # per-channel bandwidth (GB/s)
+    hbm_total_gb_per_cube: 48           # retained
+```
+
+#### Removed Settings
+
+```yaml
+# To be removed
+links:
+  xbar_to_hbm_bw_gbs: 256.0            # → replaced by channel_bw_gbs × channels_per_pe
+  xbar_to_hbm_mm: 2.5                  # → replaced by ch_router_to_hbm_mm
+  xbar_to_bridge_bw_gbs: 128.0         # → removed (no bridge)
+  xbar_to_bridge_mm: 3.0               # → removed
+  noc_to_xbar_bw_gbs: ...              # → removed
+  noc_to_xbar_mm: ...                  # → removed
+```
+
+#### Added Link Settings
+
+```yaml
+links:
+  router_link_bw_gbs: 256.0            # XY mesh link BW between routers
+  router_overhead_ns: 2.0              # router switching overhead
+  pe_to_router_bw_gbs: 256.0           # PE_DMA ↔ router
+  hbm_to_router_bw_gbs: 256.0          # HBM ↔ router (= N × channel_bw)
+```
+
+---
+
+### D9. Bandwidth Numerical Consistency
+
+| Configuration | Value |
+| ---- | --- |
+| pseudo channels per cube | 64 (parameter) |
+| PEs per cube | 8 (parameter) |
+| channels per PE (N) | `pseudo_channels / pes_per_cube` = 8 |
+| per-channel BW | 32 GB/s (parameter) |
+| per-PE local BW | N × 32 = 256 GB/s |
+| cube total HBM BW | 64 × 32 = 2048 GB/s |
+
+The effective BW per PE is identical in both modes:
+
+- 1:1 mode: N channel links × channel_bw_gbs = N × 32 = 256 GB/s
+- n:1 mode: 1 aggregated link = N × channel_bw_gbs = 256 GB/s
+
+---
+
+## Consequences
+
+### Positive
+
+- The router mesh based on cube_mesh.yaml accurately reflects physical placement
+- In n:1 mode, the existing VA scheme is preserved, keeping transition costs low
+- Local / remote / command traffic is unified onto the same mesh, resulting in simplicity
+- Aligns well with graph compiler-based topology generation
+- Channel count and PE count are both parameterized, enabling testing of various configurations
+- 1:1 mode extension naturally follows through router differentiation
+
+### Negative
+
+- The number of SimPy nodes increases due to explicit router nodes (6x6 = up to 32 routers/cube)
+- Requires complete rewrite of existing xbar/bridge/single NOC-based tests
+- The internal contention model of TwoDMeshNocComponent needs to be replaced with a per-router model
+
+---
+
+## Alternatives
+
+### A1. Retain Existing xbar + HBM Slices
+
+- Local/remote paths remain bifurcated
+- Cannot model at pseudo-channel granularity
+- Cannot switch between 1:1/n:1 modes
+
+### A2. Always Generate Per-Channel Links and Aggregate Only in n:1
+
+- Topology structure always has 1:1 size
+- Expressing n:1 semantics via link aggregation is complex
+- No reduction in router node count
+
+### A3. Gradual Transition (Retain xbar + Add NOC Path)
+
+- Higher compatibility, but dual-path coexistence increases complexity
+- Since xbar removal is ultimately necessary, the intermediate step provides little value
+
+---
+
+## Implementation Notes
+
+### topology/builder.py Change Details
+
+#### Code to Remove (within current `_instantiate_cube()`)
+
+- xbar_top, xbar_bot node creation (~line 495-508)
+- bridge.left, bridge.right node creation
+- noc ↔ xbar edge creation (~line 540-555)
+- xbar ↔ hbm_ctrl.slice edge creation (~line 510-538)
+- xbar ↔ bridge edge creation (~line 557-572)
+
+#### Code to Add
+
+1:1 mode:
+
+```python
+N = hbm_channels_per_pe  # from topology config
+total_ch = hbm_pseudo_channels
+
+# Create channel router nodes
+for ch_id in range(total_ch):
+    pe_id = ch_id // N
+    nodes[f"{cp}.ch_r{ch_id}"] = Node(
+        id=f"{cp}.ch_r{ch_id}", kind="noc_router", impl="noc_v1",
+        attrs={}, pos_mm=(...),  # horizontal row = ch_id % N
+    )
+
+# PE_DMA ↔ local channel router edges
+for pe_id in range(pes_per_cube):
+    for local_ch in range(N):
+        ch_id = pe_id * N + local_ch
+        edges.append(Edge(
+            src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.ch_r{ch_id}",
+            bw_gbs=channel_bw, kind="pe_to_ch_router", ...))
+        edges.append(Edge(
+            src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.pe{pe_id}.pe_dma",
+            bw_gbs=channel_bw, kind="ch_router_to_pe", ...))
+
+# Channel router ↔ hbm_ctrl edges
+for ch_id in range(total_ch):
+    edges.append(Edge(
+        src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.hbm_ctrl",
+        bw_gbs=channel_bw, kind="ch_router_to_hbm", ...))
+    edges.append(Edge(
+        src=f"{cp}.hbm_ctrl", dst=f"{cp}.ch_r{ch_id}",
+        bw_gbs=channel_bw, kind="hbm_to_ch_router", ...))
+
+# Horizontal line edges (same logical index)
+for row in range(N):
+    for p in range(pes_per_cube - 1):
+        ch_a = p * N + row
+        ch_b = (p + 1) * N + row
+        edges.append(Edge(
+            src=f"{cp}.ch_r{ch_a}", dst=f"{cp}.ch_r{ch_b}",
+            bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...))
+        edges.append(Edge(
+            src=f"{cp}.ch_r{ch_b}", dst=f"{cp}.ch_r{ch_a}",
+            bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...))
+```
+
+n:1 mode:
+
+```python
+# Create aggregated router nodes
+for pe_id in range(pes_per_cube):
+    nodes[f"{cp}.pe{pe_id}.agg_router"] = Node(
+        id=f"{cp}.pe{pe_id}.agg_router", kind="noc_router", impl="noc_v1",
+        attrs={}, pos_mm=(...),
+    )
+
+agg_bw = N * channel_bw  # aggregated BW
+
+# PE_DMA ↔ aggregated router
+for pe_id in range(pes_per_cube):
+    edges.append(Edge(
+        src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.pe{pe_id}.agg_router",
+        bw_gbs=agg_bw, kind="pe_to_agg_router", ...))
+    edges.append(Edge(
+        src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.pe{pe_id}.pe_dma",
+        bw_gbs=agg_bw, kind="agg_router_to_pe", ...))
+
+# Aggregated router ↔ hbm_ctrl
+for pe_id in range(pes_per_cube):
+    edges.append(Edge(
+        src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.hbm_ctrl",
+        bw_gbs=agg_bw, kind="agg_to_hbm", ...))
+    edges.append(Edge(
+        src=f"{cp}.hbm_ctrl", dst=f"{cp}.pe{pe_id}.agg_router",
+        bw_gbs=agg_bw, kind="hbm_to_agg", ...))
+
+# Horizontal links between aggregated routers
+for p in range(pes_per_cube - 1):
+    edges.append(Edge(
+        src=f"{cp}.pe{p}.agg_router", dst=f"{cp}.pe{p+1}.agg_router",
+        bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...))
+    edges.append(Edge(
+        src=f"{cp}.pe{p+1}.agg_router", dst=f"{cp}.pe{p}.agg_router",
+        bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...))
+```
+
+### Affected Existing Tests
+
+| Test File | Impact |
+| ---------- | ---- |
+| `tests/test_topology_compile.py` | Remove xbar/bridge node references, add channel router verification |
+| `tests/test_topology_load.py` | Reflect topology.yaml configuration changes |
+| `tests/test_pe_components.py` | PE_DMA routing path changes |
+| `tests/test_sip_parallel.py` | Cross-PE access path changes |
+| Cases that directly test xbar/bridge | Remove |
+
+---
+
+## Test Requirements
+
+- Verify that requests are delivered via per-channel links in 1:1 mode
+- Verify that requests are delivered via the aggregated link in n:1 mode
+- Verify that topology is correctly generated in both modes:
+  - 1:1: `total_ch` channel routers + per-PE links + horizontal links
+  - n:1: `pes_per_cube` aggregated routers + per-PE links
+- Verify that effective BW is consistent across both modes for the same workload
+- Verify that horizontal line routing works for cross-PE access
+- Verify that routing through UCIe works for cross-cube access
+- Verify that topology generation is correct under parameter variations (channels_per_pe = 4, 8, 16, etc.)
+
+---
+
+## Links
+
+- ADR-0018 (LA + BAAW) → addressing-side integration
+- ADR-0017 (Cube NOC 2D Mesh) → this ADR replaces the xbar/bridge portion
+- ADR-0004 (Memory Semantics) → BW model redefinition
+- ADR-0014 (PE Internal Execution Model) → impact from PE_DMA path changes