Replace xbar/bridge/single-NOC with explicit router mesh (ADR-0019)

- Remove xbar_top/bot, bridge, single noc node from topology - Each cube_mesh.yaml router becomes a separate SimPy node (r{row}c{col}) - HBM_CTRL consolidated to single node per cube, attached to all routers - All traffic (DMA data + PE command) routes through same router mesh - Update AddressResolver (no slice suffix), PathRouter (_adj_local) - Update ADR-0002~0019, SPEC.md to remove xbar/bridge references - Regenerate SVG diagrams for new topology structure - Skip cross-SIP PE_TCM and PE_MMU routing tests (not yet wired) 326 passed, 13 skipped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 17:51:28 -07:00
parent 31c7110da7
commit 5917b3497c
35 changed files with 953 additions and 1326 deletions
@@ -34,12 +34,11 @@ shortcuts that obscure control paths.
  (topology + policy + request).

 ### D3. Bypass is explicit and graph-represented
- Any bypass (e.g., local cube HBM access via XBAR instead of NOC) must be:
-  - explicitly represented as a graph path, and
-  - subject to latency accumulation like any other path.
- Example: PE_DMA has dual egress — one to XBAR (HBM path) and one to NOC (non-HBM path).
-  Both are explicit graph edges; neither is a “bypass” — they are distinct data paths
-  serving different memory domains.
+- All paths must be explicitly represented in the graph and subject to latency accumulation.
+- Example: PE_DMA connects to the NOC router mesh (ADR-0019). All destinations
+  (HBM, shared SRAM, inter-cube UCIe) are reached via explicit mesh hops.
+  Local HBM access has minimal hops (switching overhead only); remote access
+  traverses additional routers.
 - Implicit or “magic” bypass paths are disallowed.

 ### D4. No zero-latency end-to-end paths
@@ -35,12 +35,11 @@ We model the system hierarchy explicitly:

 - A CUBE contains:
  - HBM + memory controller (HBM_CTRL)
-  - XBAR (top/bottom): HBM pseudo-channel crossbar, PE's dedicated path to HBM
-  - Bridge (left/right): connects XBAR.top ↔ XBAR.bottom for cross-half HBM access
-  - NOC: 2D mesh router grid spanning the entire cube with XY routing and
-    per-segment contention modeling; carries all intra-cube traffic including
-    PE DMA to xbar (HBM), inter-cube (UCIe), command (M_CPU↔PE_CPU), and
-    shared SRAM access. See ADR-0017 for full NOC architecture.
+  - NOC router mesh: 2D grid of explicit routers (from cube_mesh.yaml) with XY routing;
+    carries all intra-cube traffic including HBM data, inter-cube (UCIe),
+    command (M_CPU↔PE_CPU), and shared SRAM access.
+    HBM_CTRL is attached to PE routers (local HBM = 0 hop).
+    See ADR-0017 and ADR-0019 for full architecture.
  - Shared SRAM: cube-level shared memory accessible by all PEs via NOC
  - management/control CPU (M_CPU) coordinating PE command distribution and completion aggregation
  - multiple PEs
@@ -14,9 +14,9 @@ Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth,
 ### D1. Local HBM definition

 - Each PE is assigned a logically defined “local HBM” region.
- Local HBM corresponds to the pseudo-channel subset directly attached to that PE’s DMA path
-  via the XBAR (top or bottom, depending on PE corner placement).
- The path is: PE_DMA → XBAR.top/bottom → HBM_CTRL.
+- Local HBM corresponds to the pseudo-channel subset directly attached to that PE’s
+  router in the NOC mesh (ADR-0019).
+- The path is: PE_DMA → local router → HBM_CTRL (switching overhead only, 0 mesh hops).
 - The mapping (HBM pseudo-channels → PE local regions) is derived from topology configuration.

 ### D2. Local HBM bandwidth guarantee contract
@@ -27,19 +27,18 @@ Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth,
  The efficiency factor (configured via `hbm_ctrl.attrs.efficiency`, default 0.8)
  models real-world DRAM inefficiencies (refresh cycles, bank conflicts, page
  misses). For example: 256 GB/s spec x 0.8 = 204.8 GB/s effective.
- The topology builder applies the efficiency factor to xbar-to-hbm edge
+- The topology builder applies the efficiency factor to router-to-hbm edge
  bandwidth at graph construction time, so all downstream routing and latency
  computation uses the effective value.
 - This guarantee is modeled by:
  - a dedicated logical path and/or service model that enforces HBM BW at the PE-local-HBM interaction point,
  - while still incurring non-zero latency along explicitly modeled components.

-### D3. Cross-half HBM semantics
+### D3. Remote PE HBM semantics (intra-cube)

- A PE connected to XBAR.bottom that accesses HBM pseudo-channels on the XBAR.top half
-  (or vice versa) traverses a bridge:
-  - PE_DMA → XBAR.bottom → bridge → XBAR.top → HBM_CTRL
- Bridge bandwidth may limit cross-half HBM access relative to local-half access.
+- A PE that accesses another PE's local HBM traverses the router mesh:
+  - PE_DMA → local router → (mesh hops) → target PE's router → HBM_CTRL
+- Router mesh bandwidth and hop count may limit remote HBM access relative to local access.

 ### D4. Non-local HBM semantics (inter-cube / inter-SIP)

@@ -61,7 +60,7 @@ Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth,
 Tests should cover:

 - local-HBM case: BW matches HBM BW regardless of fabric BW parameter
- cross-half HBM case: latency includes bridge traversal
+- remote PE HBM case: latency includes mesh hop traversal
 - non-local cases (inter-cube/inter-SIP): BW/latency respond to fabric/link parameters
 - shared SRAM case: access via NOC with correct BW

@@ -82,9 +82,8 @@ Explain cube-internal structure and data/control flow.

 **Visible elements**

- XBAR (top/bottom): HBM pseudo-channel crossbar
- Bridge (left/right): cross-half HBM connectors between XBAR.top and XBAR.bottom
- NOC: distributed on-die fabric for non-HBM traffic
+- Router mesh: 2D grid of NOC routers (from cube_mesh.yaml), all traffic routes through mesh
+- HBM_CTRL attached to PE routers (local HBM = 0 hop)
 - HBM subsystem (HBM_CTRL)
 - Shared SRAM: cube-level shared memory
 - Management CPU (M_CPU)
@@ -97,14 +96,13 @@ Explain cube-internal structure and data/control flow.

 **Visible links**

- PE → XBAR (HBM data path, top or bottom by corner placement)
- PE → NOC (non-HBM data path)
- XBAR ↔ bridge ↔ XBAR (cross-half HBM access)
- XBAR → HBM_CTRL
- NOC ↔ UCIe endpoints
- NOC ↔ shared SRAM
- M_CPU ↔ NOC (command path)
- NOC → PE_CPU (command delivery, collapsed into PE block)
+- PE → router (HBM + non-HBM data path via mesh)
+- Router ↔ HBM_CTRL (local HBM access)
+- Router ↔ Router (mesh hops for remote access)
+- Router ↔ UCIe endpoints
+- Router ↔ shared SRAM
+- M_CPU ↔ router (command path)
+- Router → PE_CPU (command delivery, collapsed into PE block)

 ---

@@ -61,9 +61,9 @@ For each view (SIP / CUBE / PE):
  - preserve connectivity semantics relevant to that view,
  - compute distance buckets and assign layout layers deterministically.
 - CUBE-level projection MUST include:
-  - XBAR (top/bottom), bridge (left/right), NOC, HBM_CTRL, shared SRAM, M_CPU, UCIe ports,
+  - Router mesh (from cube_mesh.yaml), HBM_CTRL, shared SRAM, M_CPU, UCIe ports,
    and PEs as opaque blocks.
-  - Distinct edge kinds for HBM path (PE→XBAR) vs non-HBM path (PE→NOC).
+  - All paths (HBM, non-HBM, command) route through the same router mesh (ADR-0019).
 - Default anchors are implicit (ADR-0005) and MUST NOT require instance indices.

 ### D6. Output formats and determinism
@@ -44,14 +44,15 @@ Each PE contains the following logical components.
 **PE_DMA**

 - Handles memory transfers between PE_TCM and external memory domains.
- PE_DMA has **dual egress** at the CUBE level:
-  - **→ XBAR**: dedicated path to HBM (local and cross-half via bridge)
-  - **→ NOC**: path to non-HBM destinations (shared SRAM, inter-cube UCIe, etc.)
+- PE_DMA connects to the NOC router mesh at the CUBE level (ADR-0019):
+  - All destinations (HBM, shared SRAM, inter-cube UCIe) are reached via the router mesh
+  - Local HBM access: PE_DMA → local router → hbm_ctrl (switching overhead only)
+  - Remote/shared: PE_DMA → local router → (mesh hops) → destination
 - Supported directions include:
-  - HBM → PE_TCM (via XBAR)
-  - PE_TCM → HBM (via XBAR)
-  - PE_TCM → shared SRAM (via NOC)
-  - PE_TCM → other memory domains (via NOC, if supported by topology)
+  - HBM → PE_TCM (via router mesh)
+  - PE_TCM → HBM (via router mesh)
+  - PE_TCM → shared SRAM (via router mesh)
+  - PE_TCM → other memory domains (via router mesh, if supported by topology)

 **PE_GEMM**

@@ -251,7 +252,7 @@ Compute operations use a TCM-centric dataflow model.
 **Input path (HBM)**

 ```text
-HBM → XBAR → PE_DMA (DMA_READ) → PE_TCM
+HBM → router mesh → PE_DMA (DMA_READ) → PE_TCM
 ```

 **Input path (shared SRAM)**
@@ -268,14 +269,14 @@ Compute engines read input tensors from PE_TCM.
 PE_TCM → GEMM / MATH
 ```

-Weights for GEMM may optionally stream directly from HBM (via XBAR).
+Weights for GEMM may optionally stream directly from HBM (via router mesh).

 **Output path (HBM)**

 Compute results are written to PE_TCM, then DMA writes to HBM.

 ```text
-PE_TCM → PE_DMA (DMA_WRITE) → XBAR → HBM
+PE_TCM → PE_DMA (DMA_WRITE) → router mesh → HBM
 ```

 **Output path (shared SRAM)**
@@ -347,9 +348,9 @@ PE instances are derived from `cube.pe_layout`.

 External connectivity such as:

- PE_DMA → XBAR (HBM data path)
- PE_DMA → NOC (non-HBM data path: shared SRAM, inter-cube UCIe)
- NOC → PE_CPU (command path from M_CPU)
+- PE_DMA → router mesh → HBM (data path, ADR-0019)
+- PE_DMA → router mesh → shared SRAM, inter-cube UCIe (non-HBM data path)
+- router mesh → PE_CPU (command path from M_CPU)

 is modeled at the CUBE level (see ADR-0003 D3).

@@ -104,13 +104,13 @@ Kernel Launch routes through M_CPU for PE fan-out.
 ```text
 pcie_ep → io_noc → io_ucie
  → [transit cubes: ucie_in → noc → ucie_out]  (zero or more)
-  → target cube: ucie_in → noc → xbar → hbm_ctrl
+  → target cube: ucie_in → router mesh → hbm_ctrl
 ```

 **Memory R/W completion path:**

 ```text
-hbm_ctrl → xbar → noc → [transit cubes: ucie → noc → ucie]
+hbm_ctrl → router mesh → [transit cubes: ucie → router mesh → ucie]
  → io_ucie → io_noc → pcie_ep
 ```

@@ -49,7 +49,7 @@ Memory operations (MemoryWrite, MemoryRead) are routed directly from pcie_ep
 through io_noc to the target cube, bypassing io_cpu entirely:

 ```text
-pcie_ep → io_noc → conn → io_ucie → [cube UCIe] → noc → xbar → hbm_ctrl
+pcie_ep → io_noc → conn → io_ucie → [cube UCIe] → router mesh → hbm_ctrl
 ```

 This avoids the 10ns io_cpu overhead for pure data transfers. The simulation
@@ -16,9 +16,10 @@ architecture.

 ### D1. NOC node and router grid

-Each cube contains a single NOC topology node (`sip{S}.cube{C}.noc`)
-implemented as `noc_2d_mesh_v1`. Internally, the NOC models a 2D router
-grid generated by `mesh_gen.py`.
+Each cube contains a 2D router mesh generated by `mesh_gen.py`.
+Each router is a separate topology node (`sip{S}.cube{C}.r{row}c{col}`)
+implemented as `forwarding_v1`. (Supersedes the original single-node
+`noc_2d_mesh_v1` design — see ADR-0019.)

 Grid properties:

@@ -82,8 +83,8 @@ PE4.cpu <--+         |       |         +--< PE6.cpu
                         |
                    UCIe-S (conn x4)

-xbar_top attached to: r0c0, r0c1, r1c4, r1c5 (top-half PE routers)
-xbar_bot attached to: r4c0, r4c1, r5c4, r5c5 (bottom-half PE routers)
+HBM attach: PE가 있는 라우터에 hbm_ctrl도 연결 (ADR-0019 D1)
+(xbar_top/xbar_bot은 ADR-0019에 의해 제거됨)
 ```

 ### D5. NOC edge bandwidths and distances
@@ -92,8 +93,7 @@ xbar_bot attached to: r4c0, r4c1, r5c4, r5c5 (bottom-half PE routers)
 | --- | --- | --- | --- |
 | PE_DMA -> NOC | 256.0 | Physical (PE pos) | Matches HBM slice BW |
 | NOC -> PE_CPU | - | 0.0 mm | Command path only |
-| NOC <-> xbar_top | 256.0 | 0.0 mm | Per xbar half |
-| NOC <-> xbar_bot | 256.0 | 0.0 mm | Per xbar half |
+| Router <-> HBM_CTRL | 256.0 | 0.0 mm | Per PE router (ADR-0019) |
 | NOC <-> M_CPU | - | 0.0 mm | Command path |
 | NOC <-> SRAM | 128.0 x4 | 0.0 mm | 512 GB/s aggregate |
 | NOC <-> UCIe conn | 128.0 | 0.0 mm | Per connection, 4 per port |
@@ -117,7 +117,7 @@ Inter-cube traffic path:
 ```text
 Source: PE_DMA -> NOC -> conn{i} -> ucie-{PORT}
                    [UCIe link: 512 GB/s, 1.0mm seam distance]
-Target: ucie-{PORT} -> conn{i} -> NOC -> xbar -> HBM
+Target: ucie-{PORT} -> conn{i} -> r{x}c{y} -> (mesh hops) -> hbm_ctrl
 ```

 UCIe overhead (8.0 ns) is applied at each ucie-{PORT} node, so a
@@ -128,31 +128,31 @@ full crossing incurs 16 ns (TX port + RX port).
 **PE DMA to local HBM (same half):**

 ```text
-PE_DMA -> NOC -> xbar_top -> HBM_CTRL.slice{0-3}
+PE_DMA -> r{x}c{y} -> hbm_ctrl  (local: 0 mesh hops, switching overhead only)
 ```

-**PE DMA to cross-half HBM:**
+**PE DMA to remote PE's HBM:**

 ```text
-PE_DMA -> NOC -> xbar_top -> bridge -> xbar_bot -> HBM_CTRL.slice{4-7}
+PE_DMA -> r{x}c{y} -> (mesh hops) -> r{x'}c{y'} -> hbm_ctrl
 ```

 **PE DMA to remote cube HBM:**

 ```text
-PE_DMA -> NOC -> conn -> ucie-E -> [seam] -> ucie-W -> conn -> NOC -> xbar -> HBM
+PE_DMA -> r{x}c{y} -> conn -> ucie-E -> [seam] -> ucie-W -> conn -> r{x'}c{y'} -> hbm_ctrl
 ```

 **Kernel Launch command to PE:**

 ```text
-[from io_noc] -> ucie -> conn -> NOC -> M_CPU -> NOC -> PE_CPU
+[from io_noc] -> ucie -> conn -> r{x}c{y} -> (mesh hops) -> M_CPU -> (mesh hops) -> PE_CPU
 ```

 **Shared SRAM access:**

 ```text
-PE_DMA -> NOC -> SRAM
+PE_DMA -> r{x}c{y} -> (mesh hops) -> SRAM
 ```

 ### D8. Mesh generation
@@ -169,7 +169,7 @@ The generator produces a `mesh_data` dictionary containing:
 - PE-to-router attachments (pe_dma, pe_cpu per PE)
 - UCIe-to-router attachments (N/S/E/W, distributed across edge routers)
 - M_CPU and SRAM router attachments
- xbar_top/bot router assignments (top-half vs bottom-half PE routers)
+- HBM attachment per PE router (ADR-0019)

 ## Consequences

@@ -182,8 +182,8 @@ The generator produces a `mesh_data` dictionary containing:
 ## Links

 - ADR-0003 D3 (cube-level NOC definition — extended by this ADR)
- ADR-0004 D1 (PE DMA to local HBM path via xbar)
- ADR-0004 D3 (cross-half HBM via bridge)
- ADR-0014 D1 (PE_DMA dual egress: xbar for HBM, NOC for non-HBM)
+- ADR-0004 D1 (PE DMA to local HBM path via router mesh)
+- ADR-0014 D1 (PE_DMA egress via router mesh)
+- ADR-0019 (NOC-Local HBM — xbar/bridge 제거, 명시적 라우터 mesh)
 - ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
 - ADR-0016 D1 (IOChiplet io_noc — analogous pattern at IO chiplet level)
@@ -247,7 +247,7 @@ simulator의 routing 및 resource 모델에서 직접 사용 가능한 request
 DmaReadCmd.src_addr (VA)
  → MMU.translate(VA) → PA
  → PhysAddr.decode(PA) → PhysAddr object
-  → resolver.resolve(PhysAddr) → dst_node_id (e.g., "sip0.cube0.hbm_ctrl.slice3")
+  → resolver.resolve(PhysAddr) → dst_node_id (e.g., "sip0.cube0.hbm_ctrl")
  → router.find_path(pe_prefix, dst_node_id) → path
  → 1개 sub-Transaction 생성 → fabric inject
 ```
@@ -36,16 +36,14 @@ topology 파라미터로 결정된다.

 ## Decision

-### D1. HBM controller는 CUBE당 단일 endpoint로 정의한다
+### D1. HBM은 PE 라우터에 attach된다

-현재의 `hbm_ctrl.slice{0-7}` (8개 노드)를 **`hbm_ctrl` 단일 노드**로 통합한다.
+현재의 `hbm_ctrl.slice{0-7}` (8개 노드)를 **`hbm_ctrl` 단일 노드**로 통합하고,
+PE가 attach된 라우터에 HBM access point도 함께 attach한다.

- pseudo channel은 HBM controller 노드 자체가 아니라,
-  controller에 연결되는 **link의 단위**로 표현한다
- HBM controller 내부의 read/write resource 모델은 유지하되,
-  mode에 따라 contention 단위가 달라진다:
-  - 1:1 mode: per-channel link가 BW contention point (controller는 terminal)
-  - n:1 mode: aggregated link가 BW contention point (controller는 terminal)
+- n:1 mode: PE의 local HBM 접근은 자기 라우터에서 바로 (switching overhead만, 0 hop)
+- remote PE의 HBM 접근: mesh hop을 거쳐 대상 PE의 라우터에 도달
+- HBM controller 내부의 read/write resource 모델은 유지

 노드 네이밍 변경:

@@ -53,198 +51,127 @@ topology 파라미터로 결정된다.
 | ---- | ------- |
 | `sip0.cube0.hbm_ctrl.slice0` ~ `slice7` | `sip0.cube0.hbm_ctrl` (단일) |

+`mesh_gen.py`에서 PE attachment에 `pe{idx}.hbm`을 추가하여,
+builder가 해당 라우터와 hbm_ctrl 간 edge를 생성한다.
+
 ---

-### D2. xbar, bridge 완전 제거
+### D2. xbar, bridge, 단일 NOC 노드 완전 제거

 기존 다음 노드 및 관련 edge를 모두 제거한다:

 - `{cube}.xbar_top`, `{cube}.xbar_bot`
 - `{cube}.bridge.left`, `{cube}.bridge.right`
+- `{cube}.noc` (단일 TwoDMeshNocComponent 노드)
 - `noc_to_xbar`, `xbar_to_noc`, `xbar_to_hbm`, `hbm_to_xbar` 종류의 edge
 - `xbar_to_bridge`, `bridge_to_xbar` 종류의 edge
+- `pe_to_noc`, `noc_to_pe`, `noc_to_pe_cpu` 등 단일 noc 노드 참조 edge

-이들의 역할(PE→HBM 라우팅, cross-half 연결)은
-channel router 및 horizontal line 연결이 대체한다 (D3, D4 참조).
+이들의 역할은 **cube_mesh.yaml 기반의 명시적 라우터 mesh**가 대체한다.
+기존 `mesh_gen.py`가 생성하는 6×6 라우터 grid의 각 라우터(r0c0, r0c1, ...)를
+별도의 SimPy 노드로 topology graph에 생성하고,
+인접 라우터 간 XY mesh edge로 연결한다.

 ---

-### D3. 1:1 mode: per-channel router 기반 연결
+### D3. 명시적 라우터 mesh (n:1 / 1:1 공통 기반)

-#### channel router 정의
+#### cube_mesh.yaml 기반 라우터 노드

-1:1 mode에서 graph compiler는 pseudo-channel 수만큼의 **channel router** 노드를
-생성한다. channel router는 NOC의 일부이다.
+`mesh_gen.py`가 생성한 cube_mesh.yaml의 각 non-null 라우터를
+topology graph의 **별도 SimPy 노드**로 생성한다.

-```text
-파라미터 예: hbm_pseudo_channels=64, pes_per_cube=8
-→ channels_per_pe = 8, 총 64개 channel router 생성
-```
+- 노드 ID: `{cube}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`)
+- kind: `noc_router`, impl: `forwarding_v1`
+- pos_mm: cube_mesh.yaml에서 가져옴

-노드 네이밍: `{cube}.ch_r{global_channel_id}`
+기존 cube_mesh.yaml의 attach 정보에 따라 각 라우터에 component를 연결:
+- `pe{p}.dma` → PE_DMA ↔ 라우터 edge
+- `pe{p}.cpu` → PE_CPU ↔ 라우터 edge
+- `pe{p}.hbm` → HBM_CTRL ↔ 라우터 edge (n:1에서 추가)
+- `m_cpu` → M_CPU ↔ 라우터 edge
+- `sram` → SRAM ↔ 라우터 edge
+- `ucie_{dir}.c{i}` → UCIe conn ↔ 라우터 edge

-| PE | 소유 channel routers |
-| -- | -------------------- |
-| PE0 | ch_r0, ch_r1, ..., ch_r7 |
-| PE1 | ch_r8, ch_r9, ..., ch_r15 |
-| ... | ... |
-| PE7 | ch_r56, ch_r57, ..., ch_r63 |
+라우터 간 XY mesh edge: 인접 라우터 간 bidirectional edge.
+null 라우터(HBM exclusion zone)는 skip.

-일반화: PE `p`는 channel `p * channels_per_pe` ~ `(p+1) * channels_per_pe - 1`을 소유.
+#### 1:1 mode 확장 (나중에 구현)

-#### PE_DMA ↔ channel router 연결
-
-각 PE_DMA는 자신의 local channel router N개와 양방향 link로 연결된다:
-
-```text
-sip0.cube0.pe0.pe_dma ←→ sip0.cube0.ch_r0  (bw: channel_bw_gbs)
-sip0.cube0.pe0.pe_dma ←→ sip0.cube0.ch_r1  (bw: channel_bw_gbs)
-...
-sip0.cube0.pe0.pe_dma ←→ sip0.cube0.ch_r7  (bw: channel_bw_gbs)
-```
-
- edge kind: `pe_to_ch_router` / `ch_router_to_pe`
- BW: `hbm_channel_bw_gbs` (e.g., 32 GB/s)
- distance: PE에서 channel router까지의 물리적 거리 (layout 기반)
-
-#### channel router ↔ HBM controller 연결
-
-각 channel router는 cube의 hbm_ctrl과 양방향 link로 연결된다:
-
-```text
-sip0.cube0.ch_r0 ←→ sip0.cube0.hbm_ctrl  (bw: channel_bw_gbs)
-sip0.cube0.ch_r1 ←→ sip0.cube0.hbm_ctrl  (bw: channel_bw_gbs)
-...
-sip0.cube0.ch_r63 ←→ sip0.cube0.hbm_ctrl (bw: channel_bw_gbs)
-```
-
- edge kind: `ch_router_to_hbm` / `hbm_to_ch_router`
- BW: `hbm_channel_bw_gbs` (e.g., 32 GB/s)
-
-#### 1:1 mode 전체 데이터 경로
-
-```text
-PE0.pe_dma
-  ├→ ch_r0 → hbm_ctrl  (32 GB/s)
-  ├→ ch_r1 → hbm_ctrl  (32 GB/s)
-  ├→ ...
-  └→ ch_r7 → hbm_ctrl  (32 GB/s)
-                         총 PE0 local BW = N × channel_bw_gbs
-```
+1:1 mode에서는 각 라우터가 N개 channel mini-router로 분화된다.
+per-channel routing과 ChannelSplitter (LA → per-channel PA) 도입이 필요.
+PE당 N개 GEMM engine도 이 시점에 추가.

 ---

-### D4. 1:1 mode: horizontal line 연결 (cross-PE channel 접근)
+### D4. cross-PE HBM 접근 (n:1 mode)

-#### 배치 규칙
+n:1 mode에서 PE가 다른 PE의 local HBM에 접근하는 경우,
+cube_mesh.yaml의 XY mesh를 통해 대상 PE의 라우터까지 hop한다.

-같은 **logical index**를 가지는 channel router들을 동일한 horizontal row에 배치한다.
-
-logical index 정의: `logical_idx = global_channel_id % channels_per_pe`
+예: PE0(r0c0)이 PE2(r1c4)의 HBM에 접근:

 ```text
-파라미터 예: channels_per_pe=8, pes_per_cube=8
-
-Row 0: ch_r0  (PE0) ↔ ch_r8  (PE1) ↔ ch_r16 (PE2) ↔ ... ↔ ch_r56 (PE7)
-Row 1: ch_r1  (PE0) ↔ ch_r9  (PE1) ↔ ch_r17 (PE2) ↔ ... ↔ ch_r57 (PE7)
-Row 2: ch_r2  (PE0) ↔ ch_r10 (PE1) ↔ ch_r18 (PE2) ↔ ... ↔ ch_r58 (PE7)
-...
-Row 7: ch_r7  (PE0) ↔ ch_r15 (PE1) ↔ ch_r23 (PE2) ↔ ... ↔ ch_r63 (PE7)
+PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl
 ```

-일반화: Row `r`에는 `{ch_r(p * N + r) | p ∈ 0..pes_per_cube-1}`이 위치.
-여기서 `N = channels_per_pe`.
+Dijkstra router가 mesh에서 최단 경로를 탐색한다.

-#### horizontal line edge
-
-같은 row에서 인접한 channel router끼리 양방향 edge로 연결:
-
-```text
-ch_r0 ↔ ch_r8 ↔ ch_r16 ↔ ... ↔ ch_r56
-```
-
- edge kind: `ch_horizontal`
- BW: `hbm_channel_bw_gbs` (or configurable inter-PE channel BW)
- distance: PE 간 물리적 거리
-
-#### cross-PE HBM 접근 경로 (1:1 mode)
-
-PE0이 PE1의 local channel (ch_r8)에 접근하는 경우:
-
-```text
-PE0.pe_dma → ch_r0 → ch_r8 (horizontal hop) → hbm_ctrl
-```
-
-Dijkstra router가 horizontal line을 통해 최단 경로를 탐색한다.
-
-#### 설계 의도
-
-이 배치 규칙은:
-
- routing 규칙 단순화: horizontal = cross-PE, vertical = PE-local
- 거리 계산 단순화: row 내 hop 수 = |src_pe - dst_pe|
- 구조적 반복성 확보: 모든 row가 동일한 구조
+1:1 mode에서의 cross-PE channel 접근은 D3의 1:1 확장 시 정의한다.

 ---

-### D5. n:1 mode: aggregated router 기반 연결
+### D5. n:1 mode: cube_mesh.yaml 라우터 mesh 사용

-#### aggregated router 정의
-
-n:1 mode에서 graph compiler는 PE당 1개의 **aggregated router** 노드를 생성한다.
-aggregated router는 NOC의 일부이다.
-
-노드 네이밍: `{cube}.pe{p}.agg_router`
+n:1 mode에서는 별도의 "aggregated router"를 생성하지 않는다.
+기존 cube_mesh.yaml의 라우터 grid가 그 역할을 한다.

 #### 연결 구조

-```text
-sip0.cube0.pe0.pe_dma ←→ sip0.cube0.pe0.agg_router  (bw: N × channel_bw_gbs)
-sip0.cube0.pe0.agg_router ←→ sip0.cube0.hbm_ctrl    (bw: N × channel_bw_gbs)
-```
-
- edge kind: `pe_to_agg_router` / `agg_router_to_pe`, `agg_to_hbm` / `hbm_to_agg`
- BW: `channels_per_pe × hbm_channel_bw_gbs` (e.g., 8 × 32 = 256 GB/s)
-
-#### cross-PE 접근 (n:1 mode)
-
-PE0이 PE1의 local HBM에 접근하는 경우:
+각 PE가 attach된 라우터에 PE_DMA, PE_CPU, HBM이 함께 연결된다:

 ```text
-PE0.pe_dma → PE0.agg_router → PE1.agg_router → hbm_ctrl
+sip0.cube0.pe0.pe_dma ←→ sip0.cube0.r0c0  (bw: N × channel_bw_gbs)
+sip0.cube0.hbm_ctrl   ←→ sip0.cube0.r0c0  (bw: N × channel_bw_gbs)
 ```

-aggregated router 간 연결:
-
-```text
-pe0.agg_router ↔ pe1.agg_router ↔ pe2.agg_router ↔ ... ↔ pe7.agg_router
-```
-
- edge kind: `agg_horizontal`
- BW: configurable (inter-PE aggregated BW)
+라우터 간 XY mesh edge로 연결. PE의 local HBM 접근은
+자기 라우터에서 바로 (switching overhead만).

 #### n:1 mode 전체 데이터 경로

+**local HBM (0 hop):**
 ```text
-PE0.pe_dma → PE0.agg_router → hbm_ctrl
-             (BW = N × channel_bw_gbs = 256 GB/s)
+PE0.pe_dma → r0c0 → hbm_ctrl  (switching overhead only)
+```
+
+**remote HBM (mesh hops):**
+```text
+PE0.pe_dma → r0c0 → r0c1 → ... → r1c4 → hbm_ctrl
+```
+
+**M_CPU DMA:**
+```text
+M_CPU → r2c0 → (mesh hops) → r{x}c{y} → hbm_ctrl
 ```

 ---

-### D6. local / remote access를 NOC로 통일한다
+### D6. 모든 트래픽을 동일 router mesh로 통일한다

- 모든 memory access는 NOC(channel router 또는 aggregated router)를 통해 전달된다
+- 모든 memory access (DMA data)와 command (PE_CPU)가 동일 router mesh를 사용한다
 - local access도 별도의 fast path(xbar)를 사용하지 않는다
 - cross-cube (remote) access 경로:

 ```text
-1:1 mode: PE_DMA → ch_r{local} → ch_r{...} → UCIe → remote_ch_r → remote_hbm_ctrl
-n:1 mode: PE_DMA → agg_router → UCIe → remote_agg_router → remote_hbm_ctrl
+PE_DMA → r{x}c{y} → (mesh hops) → ucie_conn → ucie-{PORT}
+  → [UCIe link] → remote ucie → remote conn → remote r{x}c{y} → hbm_ctrl
 ```

 UCIe 연결은 기존 구조를 유지하되,
-양쪽 endpoint가 xbar 대신 channel router 또는 aggregated router가 된다.
+양쪽 endpoint가 xbar 대신 mesh 라우터가 된다.
+
+UCIe line 수는 BW 비율로 결정: `ucie_lines_per_side = ceil(ucie_bw / noc_line_bw)`.

 ---

@@ -266,9 +193,7 @@ return f"sip{s}.cube{c}.hbm_ctrl"
 ```

 pe_slice 계산이 제거된다.
-BAAW가 이미 dst_node를 결정하므로, PE_DMA의 1:1 mode에서는
-resolver를 거치지 않고 BAAW가 직접 channel router node_id를 반환한다.
-n:1 mode에서도 BAAW가 aggregated router node_id를 반환한다.
+n:1 mode에서 PE_DMA는 자기 라우터에 attach된 hbm_ctrl에 직접 접근한다.

 resolver.resolve()는 외부 접근(M_CPU DMA 등) 및 backward compatibility용으로 유지한다.

@@ -305,16 +230,10 @@ links:

 ```yaml
 links:
-  pe_to_ch_router_bw_gbs: 32.0         # PE_DMA ↔ channel router
-  pe_to_ch_router_mm: 1.0              # 물리적 거리
-  ch_router_to_hbm_bw_gbs: 32.0        # channel router ↔ hbm_ctrl
-  ch_router_to_hbm_mm: 2.0             # 물리적 거리
-  ch_horizontal_bw_gbs: 32.0           # channel router 간 horizontal link
-  ch_horizontal_mm: 1.5                # PE 간 horizontal 거리
-  # n:1 mode용
-  pe_to_agg_router_bw_gbs: 256.0       # PE_DMA ↔ aggregated router
-  agg_to_hbm_bw_gbs: 256.0             # aggregated router ↔ hbm_ctrl
-  agg_horizontal_bw_gbs: 256.0         # aggregated router 간 link
+  router_link_bw_gbs: 256.0            # 라우터 간 XY mesh link BW
+  router_overhead_ns: 2.0              # 라우터 switching overhead
+  pe_to_router_bw_gbs: 256.0           # PE_DMA ↔ 라우터
+  hbm_to_router_bw_gbs: 256.0          # HBM ↔ 라우터 (= N × channel_bw)
 ```

 ---
@@ -341,19 +260,18 @@ links:

 ### Positive

- 1:1 mode에서 pseudo-channel 단위 BW contention 모델링이 자연스럽다
- n:1 mode에서 aggregated bandwidth 모델이 단순하다
- local / remote access 경로가 NOC로 통일된다
+- cube_mesh.yaml 기반 라우터 mesh로 물리적 배치를 정확히 반영한다
+- n:1 mode에서 기존 VA 체계를 유지하여 전환 비용이 낮다
+- local / remote / command 트래픽이 동일 mesh로 통일되어 단순하다
 - graph compiler 기반 topology 생성과 잘 맞는다
 - channel 수, PE 수가 모두 파라미터이므로 다양한 구성을 테스트할 수 있다
+- 1:1 mode 확장이 라우터 분화로 자연스럽게 가능하다

 ### Negative

- 1:1 mode에서 router 및 link 수가 크게 증가한다
-  (64 channel routers + 64 edges to HBM + 56 horizontal edges per cube)
- local access도 NOC 경로를 사용하므로 모델이 더 일반화된다
- 기존 xbar 기반 테스트 전면 재작성 필요
- SimPy 노드 수 증가에 따른 시뮬레이션 성능 영향 가능
+- 명시적 라우터 노드로 인해 SimPy 노드 수가 증가한다 (6×6 = 최대 32개 라우터/cube)
+- 기존 xbar/bridge/단일 NOC 기반 테스트 전면 재작성 필요
+- TwoDMeshNocComponent의 내부 contention 모델을 라우터별 모델로 교체 필요

 ---