From a0cccc71e806cdf9f2b5eb2d020dc17f88d57cd9 Mon Sep 17 00:00:00 2001
From: Yangwook Kang <ywkang80@gmail.com>
Date: Thu, 14 May 2026 23:23:52 -0700
Subject: [PATCH] Add HW architecture overview (Korean)

Standalone summary of the modeled hardware hierarchy and components.
Cross-references ADR-0003, 0004, 0014, 0017, 0022.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/hw-architecture-overview.md | 237 +++++++++++++++++++++++++++++++
 1 file changed, 237 insertions(+)
 create mode 100644 docs/hw-architecture-overview.md

diff --git a/docs/hw-architecture-overview.md b/docs/hw-architecture-overview.md
new file mode 100644
index 0000000..ec5c5e1
--- /dev/null
+++ b/docs/hw-architecture-overview.md
@@ -0,0 +1,237 @@
+# Hardware Architecture Overview
+
+본 문서는 AI Accelerator 플랫폼의 하드웨어 아키텍처를 요약한다.
+논문 분석 및 설계 검토 시 배경 지식으로 사용할 수 있다.
+
+> Source ADRs: ADR-0003, ADR-0004, ADR-0014, ADR-0017, ADR-0022
+
+---
+
+## 1. System Hierarchy
+
+시스템은 4단계 계층으로 구성된다.
+
+```
+Tray
+ ├── Host CPU (runtime, data placement)
+ ├── SIP 0 (accelerator)
+ │    ├── IO Chiplet (PCIe-EP, IO_CPU)
+ │    ├── CUBE 0
+ │    │    ├── PE 0 ─ PE 7
+ │    │    ├── HBM + HBM_CTRL
+ │    │    ├── Shared SRAM
+ │    │    ├── M_CPU (management)
+ │    │    ├── NOC 2D Mesh (router grid)
+ │    │    └── UCIe × 4 (N/S/E/W)
+ │    ├── CUBE 1 ... CUBE N
+ │    └── IO Chiplet(s)
+ ├── SIP 1 ... SIP M
+ └── Interconnect (PCIe / UAL)
+```
+
+| Level | 구성 | 연결 |
+|-------|------|------|
+| **Tray** | Host CPU + 여러 SIP | PCIe / UAL fabric |
+| **SIP** | 여러 CUBE + IO chiplet(s) | UCIe (cube간), PCIe-EP (host) |
+| **CUBE** | 여러 PE + HBM + SRAM + M_CPU + NOC mesh | UCIe × 4 ports (N/S/E/W) |
+| **PE** | PE_CPU + DMA + GEMM + MATH + TCM | NOC router 직결 |
+
+---
+
+## 2. CUBE Architecture
+
+각 CUBE는 독립적인 compute + memory unit이다.
+
+### 2.1 Components
+
+- **PEs**: 복수의 Processing Element, 각각 독립 커널 실행 가능
+- **HBM + HBM_CTRL**: High Bandwidth Memory. 각 PE에 local HBM 영역이 할당되어 최소 latency로 접근
+- **Shared SRAM**: Cube 내 모든 PE가 NOC를 통해 접근 가능한 공유 메모리
+- **M_CPU**: Management CPU. 커널 command 분배 및 completion 집계
+- **NOC (On-die Fabric)**: Cube 내 모든 컴포넌트를 연결하는 interconnect
+- **UCIe × 4**: 각 방향(N/S/E/W)에 복수 connection, inter-cube 연결
+
+### 2.2 NOC (On-die Fabric)
+
+NOC는 cube 내 PE, HBM, SRAM, M_CPU, UCIe를 연결하는 on-die interconnect이다.
+
+**아키텍처 요구사항** (topology 무관):
+- 모든 PE가 local HBM에 full bandwidth로 접근 가능
+- 모든 PE가 shared SRAM에 접근 가능
+- 모든 PE가 UCIe를 통해 다른 cube에 접근 가능
+- M_CPU가 모든 PE에 command를 전달 가능
+- Per-link contention 모델링 지원
+
+**현재 시뮬레이터 구현** (변경 가능):
+- 2D mesh router grid (6×6 기본, XY deterministic routing)
+- HBM_CTRL가 각 PE의 local router에 직결 (0 mesh hop)
+- 중앙 HBM zone에는 router 배치 제외
+- Contention: directed segment당 capacity=1 resource
+
+NOC topology는 2D mesh 외에 ring, crossbar, hierarchical 등 다른 구현도 가능하며,
+아키텍처 요구사항을 만족하는 한 교체 가능하다.
+
+### 2.3 주요 Data Path
+
+| Path | Route | 특성 |
+|------|-------|------|
+| PE → Local HBM | PE_DMA → NOC → HBM_CTRL | 최소 hop, 256 GB/s (×0.8 eff) |
+| PE → Remote PE's HBM | PE_DMA → NOC hops → HBM_CTRL | NOC BW/hop에 제한 |
+| PE → Shared SRAM | PE_DMA → NOC → SRAM | SRAM link BW에 제한 |
+| PE → Other CUBE's HBM | PE_DMA → NOC → UCIe → NOC → HBM_CTRL | UCIe overhead 16ns (TX+RX) |
+| Kernel Launch | IO → UCIe → M_CPU → NOC → PE_CPU | Command path |
+
+### 2.4 Key Bandwidths
+
+| Connection | Bandwidth | Notes |
+|------------|-----------|-------|
+| PE_DMA ↔ NOC | 256 GB/s | HBM slice BW 매칭 |
+| NOC ↔ HBM_CTRL | 256 GB/s | Per PE, local 접근 |
+| NOC ↔ SRAM | 128 GB/s × 4 | 512 GB/s aggregate |
+| NOC ↔ UCIe conn | 128 GB/s × 4 | 512 GB/s per port |
+| UCIe link (inter-cube) | 512 GB/s | 1.0mm seam distance |
+
+---
+
+## 3. PE Architecture
+
+각 PE는 하나의 커널 인스턴스를 실행하는 독립적인 프로세서이다.
+
+### 3.1 Internal Components
+
+```
+PE_CPU (control)
+  │
+  ├──→ PE_SCHED (dispatch)
+  │       │
+  │       ├──→ PE_DMA ←→ NOC Router ←→ HBM / SRAM / UCIe
+  │       │      ↕
+  │       ├──→ PE_FETCH_STORE ←→ PE_TCM (16MB SRAM)
+  │       │
+  │       ├──→ PE_GEMM (matrix multiply)
+  │       └──→ PE_MATH (elementwise)
+  │
+  └──→ PE_IPCQ (collective communication)
+           │
+           └──→ PE_DMA (IPCQ port)
+```
+
+| Component | 역할 |
+|-----------|------|
+| **PE_CPU** | 커널 instruction stream 실행, command 생성 |
+| **PE_SCHED** | Command dispatcher. Composite command를 tile pipeline으로 분해 |
+| **PE_DMA** | HBM ↔ TCM 데이터 전송 (NOC router mesh 경유). Read/Write 각 1 channel |
+| **PE_GEMM** | 행렬 곱 엔진. TCM에서 activation 읽기, HBM에서 weight streaming 가능 |
+| **PE_MATH** | Element-wise 연산 엔진. TCM 읽기/쓰기 |
+| **PE_TCM** | 16MB on-PE SRAM. Compute의 staging memory |
+| **PE_IPCQ** | PE간 collective communication 제어 (ring buffer pointer 관리) |
+
+### 3.2 Compute Pipeline (Tiled Execution)
+
+Composite command는 tile 단위로 pipeline 실행된다:
+
+```
+DMA_READ(t) → COMPUTE(t) → DMA_WRITE(t)
+```
+
+**Overlap 규칙**:
+- 허용: `DMA_READ(t+1) ∥ COMPUTE(t)`, `DMA_WRITE(t-1) ∥ COMPUTE(t)`
+- 금지: `GEMM(t) ∥ GEMM(t')`, `GEMM(t) ∥ MATH(t')`
+
+**DMA Engine**: Read/Write 각각 capacity=1. 동시 Read+Write 가능, 동시 Read+Read 불가.
+
+**Compute Engine**: GEMM과 MATH가 단일 compute slot 공유. 한 번에 하나만 실행.
+
+### 3.3 TCM-centric Dataflow
+
+모든 compute는 TCM을 중심으로 동작한다:
+
+```
+Input:   HBM → (NOC) → PE_DMA → PE_TCM
+Compute: PE_TCM → GEMM / MATH → PE_TCM
+Output:  PE_TCM → PE_DMA → (NOC) → HBM
+```
+
+PE_TCM은 두 영역으로 분할된다:
+- **SchedulerReservedTCM**: PE_SCHED 전용 tile buffer 영역 (DMA/compute staging)
+- **AllocatableTCM**: 범용 할당 영역 (host/DP-visible)
+
+두 영역은 hard isolation으로 분리된다.
+
+---
+
+## 4. Memory Hierarchy
+
+### 4.1 Memory Tiers
+
+| Memory | Scope | Capacity | Bandwidth | Latency | 접근 경로 |
+|--------|-------|----------|-----------|---------|-----------|
+| **PE_TCM** | PE 전용 | 16 MB | 512 GB/s | 최저 | 직결 (NOC 미경유) |
+| **Shared SRAM** | Cube 공유 | 32 MB | 128 GB/s (NoC link) | 중간 | PE → NOC → SRAM |
+| **Local HBM** | PE별 할당 | Large | 256 GB/s (×0.8 eff) | 높음 | PE → local router → HBM_CTRL |
+| **Remote HBM** | 다른 PE/Cube | Large | Mesh/UCIe BW 제한 | 최고 | PE → NOC mesh → (UCIe) → HBM_CTRL |
+
+### 4.2 Local HBM Bandwidth Guarantee
+
+- 각 PE는 자신의 local router에 직결된 HBM pseudo-channel을 가진다
+- Local HBM 접근은 **0 mesh hop** (switching overhead만)
+- Effective bandwidth = spec BW × efficiency factor (default 0.8)
+- 예: 256 GB/s × 0.8 = 204.8 GB/s effective
+- 이 보장은 fabric bandwidth와 무관하게 유지된다
+
+### 4.3 Memory-Centric Design Principle
+
+- **Compute는 data 근처에서 실행**: PE가 local HBM에 직결되어 데이터 이동 최소화
+- **TCM은 compute의 scratchpad**: 모든 compute 입출력은 TCM을 경유
+- **HBM은 primary storage**: 대용량 tensor 저장, DMA로 TCM에 tile 단위 load/store
+- **Shared SRAM은 cube-level 공유**: 중간 결과 공유, reduction buffer 등
+
+---
+
+## 5. SPMD Execution Model
+
+### 5.1 Program ID Mapping
+
+커널은 2D hardware grid에서 SPMD 방식으로 실행된다:
+
+| API | 반환 값 | 설명 |
+|-----|---------|------|
+| `tl.program_id(axis=0)` | `local_pe_id` | Cube 내 PE 인덱스 |
+| `tl.program_id(axis=1)` | `cube_id` | Cube 인덱스 |
+| `tl.num_programs(axis=0)` | `num_pes_per_cube` | Cube당 PE 수 |
+| `tl.num_programs(axis=1)` | `num_cubes` | 전체 Cube 수 |
+
+```python
+global_pid = tl.program_id(axis=1) * tl.num_programs(axis=0) + tl.program_id(axis=0)
+```
+
+### 5.2 Axis Mapping Rationale
+
+- **axis=0 = PE (innermost)**: Cube 내 PE는 HBM을 공유하고 local NOC로 통신. 빠르고 tightly-coupled. GPU의 thread-in-block에 대응.
+- **axis=1 = Cube (outer)**: Cube 간 통신은 UCIe 경유로 latency 높음. Coarse scheduling 단위. GPU의 block-in-grid에 대응.
+
+### 5.3 Kernel Execution Flow
+
+```
+Host CPU
+  → IO_CPU (PCIe-EP)
+    → M_CPU (management, per cube)
+      → PE_CPU × N (broadcast)
+        → Each PE executes same kernel with unique (pe_id, cube_id)
+```
+
+모든 PE가 동일 커널을 실행하되, `program_id`로 자신의 데이터 파티션을 식별하여
+독립적으로 처리한다 (SPMD).
+
+---
+
+## 6. Inter-PE Communication (IPCQ)
+
+PE 간 collective communication은 IPCQ(Inter-PE Communication Queue)를 통해 수행된다.
+
+- 각 PE는 방향별(N/S/E/W 등) ring buffer 기반 queue pair를 유지
+- **DMA-IPCQ co-design**: DMA data flit에 head pointer를 piggyback하여 별도 제어 메시지 없이 pointer 동기화
+- **Credit-based flow control**: Receiver가 slot 소비 후 16B credit으로 sender에게 알림
+- IPCQ slot buffer는 **TCM, Shared SRAM, Local HBM** 중 선택 가능
+
+자세한 내용은 `docs/ipcq-dma-codesign-hw.md` 및 ADR-0023 참조.