Compare commits
4 Commits
9beb140eaa
...
a44f832be5
| Author | SHA1 | Date | |
|---|---|---|---|
| a44f832be5 | |||
| a0cccc71e8 | |||
| 32b29a1e5c | |||
| c9bd5387ac |
@@ -35,11 +35,13 @@ We model the system hierarchy explicitly:
|
|||||||
|
|
||||||
- A CUBE contains:
|
- A CUBE contains:
|
||||||
- HBM + memory controller (HBM_CTRL)
|
- HBM + memory controller (HBM_CTRL)
|
||||||
- NOC router mesh: 2D grid of explicit routers (from cube_mesh.yaml) with XY routing;
|
- NOC (on-die fabric): carries all intra-cube traffic including HBM data,
|
||||||
carries all intra-cube traffic including HBM data, inter-cube (UCIe),
|
inter-cube (UCIe), command (M_CPU↔PE_CPU), and shared SRAM access.
|
||||||
command (M_CPU↔PE_CPU), and shared SRAM access.
|
Must provide: full-BW PE↔local HBM path, PE↔SRAM connectivity,
|
||||||
HBM_CTRL is attached to PE routers (local HBM = 0 hop).
|
PE↔UCIe connectivity, M_CPU↔PE command path.
|
||||||
See ADR-0017 and ADR-0019 for full architecture.
|
NOC topology is an implementation choice (e.g., 2D mesh, ring, crossbar);
|
||||||
|
current implementation uses a 2D mesh with XY routing (see ADR-0017).
|
||||||
|
HBM_CTRL is attached to each PE's local NOC port (local HBM = minimal hop).
|
||||||
- Shared SRAM: cube-level shared memory accessible by all PEs via NOC
|
- Shared SRAM: cube-level shared memory accessible by all PEs via NOC
|
||||||
- management/control CPU (M_CPU) coordinating PE command distribution and completion aggregation
|
- management/control CPU (M_CPU) coordinating PE command distribution and completion aggregation
|
||||||
- multiple PEs
|
- multiple PEs
|
||||||
|
|||||||
@@ -44,15 +44,15 @@ Each PE contains the following logical components.
|
|||||||
**PE_DMA**
|
**PE_DMA**
|
||||||
|
|
||||||
- Handles memory transfers between PE_TCM and external memory domains.
|
- Handles memory transfers between PE_TCM and external memory domains.
|
||||||
- PE_DMA connects to the NOC router mesh at the CUBE level (ADR-0019):
|
- PE_DMA connects to the cube-level NOC (on-die fabric):
|
||||||
- All destinations (HBM, shared SRAM, inter-cube UCIe) are reached via the router mesh
|
- All destinations (HBM, shared SRAM, inter-cube UCIe) are reached via the NOC
|
||||||
- Local HBM access: PE_DMA → local router → hbm_ctrl (switching overhead only)
|
- Local HBM access: PE_DMA → NOC → hbm_ctrl (minimal hop)
|
||||||
- Remote/shared: PE_DMA → local router → (mesh hops) → destination
|
- Remote/shared: PE_DMA → NOC → (fabric hops) → destination
|
||||||
- Supported directions include:
|
- Supported directions include:
|
||||||
- HBM → PE_TCM (via router mesh)
|
- HBM → PE_TCM (via NOC)
|
||||||
- PE_TCM → HBM (via router mesh)
|
- PE_TCM → HBM (via NOC)
|
||||||
- PE_TCM → shared SRAM (via router mesh)
|
- PE_TCM → shared SRAM (via NOC)
|
||||||
- PE_TCM → other memory domains (via router mesh, if supported by topology)
|
- PE_TCM → other memory domains (via NOC, if supported by topology)
|
||||||
|
|
||||||
**PE_GEMM**
|
**PE_GEMM**
|
||||||
|
|
||||||
@@ -252,7 +252,7 @@ Compute operations use a TCM-centric dataflow model.
|
|||||||
**Input path (HBM)**
|
**Input path (HBM)**
|
||||||
|
|
||||||
```text
|
```text
|
||||||
HBM → router mesh → PE_DMA (DMA_READ) → PE_TCM
|
HBM → NOC → PE_DMA (DMA_READ) → PE_TCM
|
||||||
```
|
```
|
||||||
|
|
||||||
**Input path (shared SRAM)**
|
**Input path (shared SRAM)**
|
||||||
@@ -269,14 +269,14 @@ Compute engines read input tensors from PE_TCM.
|
|||||||
PE_TCM → GEMM / MATH
|
PE_TCM → GEMM / MATH
|
||||||
```
|
```
|
||||||
|
|
||||||
Weights for GEMM may optionally stream directly from HBM (via router mesh).
|
Weights for GEMM may optionally stream directly from HBM (via NOC).
|
||||||
|
|
||||||
**Output path (HBM)**
|
**Output path (HBM)**
|
||||||
|
|
||||||
Compute results are written to PE_TCM, then DMA writes to HBM.
|
Compute results are written to PE_TCM, then DMA writes to HBM.
|
||||||
|
|
||||||
```text
|
```text
|
||||||
PE_TCM → PE_DMA (DMA_WRITE) → router mesh → HBM
|
PE_TCM → PE_DMA (DMA_WRITE) → NOC → HBM
|
||||||
```
|
```
|
||||||
|
|
||||||
**Output path (shared SRAM)**
|
**Output path (shared SRAM)**
|
||||||
@@ -348,9 +348,9 @@ PE instances are derived from `cube.pe_layout`.
|
|||||||
|
|
||||||
External connectivity such as:
|
External connectivity such as:
|
||||||
|
|
||||||
- PE_DMA → router mesh → HBM (data path, ADR-0019)
|
- PE_DMA → NOC → HBM (data path)
|
||||||
- PE_DMA → router mesh → shared SRAM, inter-cube UCIe (non-HBM data path)
|
- PE_DMA → NOC → shared SRAM, inter-cube UCIe (non-HBM data path)
|
||||||
- router mesh → PE_CPU (command path from M_CPU)
|
- NOC → PE_CPU (command path from M_CPU)
|
||||||
|
|
||||||
is modeled at the CUBE level (see ADR-0003 D3).
|
is modeled at the CUBE level (see ADR-0003 D3).
|
||||||
|
|
||||||
|
|||||||
@@ -106,36 +106,47 @@ Note: multi-stream merging at routers IS modeled correctly — each
|
|||||||
in_port has its own fan_in process, all push to a shared inbox, and
|
in_port has its own fan_in process, all push to a shared inbox, and
|
||||||
the router worker forwards in inbox FIFO order. Flits from different
|
the router worker forwards in inbox FIFO order. Flits from different
|
||||||
upstream streams naturally interleave at flit granularity. The items
|
upstream streams naturally interleave at flit granularity. The items
|
||||||
below are different concerns.
|
below are different concerns, ordered by expected workload impact.
|
||||||
|
|
||||||
|
**Higher impact (workload accuracy gap)**:
|
||||||
|
|
||||||
- [ ] **Cycle-accurate router arbitration policies** (RR with
|
|
||||||
priorities, age, iSLIP). Currently the inbox FIFO order is used as
|
|
||||||
a proxy for fair RR — works when flit arrival times differ slightly
|
|
||||||
between streams, but doesn't reflect intentional priority/QoS.
|
|
||||||
- [ ] **Sub-flit (32B) granularity** for finer wire arbitration
|
|
||||||
cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates
|
|
||||||
per 32B flit. Effect is small for most workloads (sub-flit timing
|
|
||||||
noise).
|
|
||||||
- [ ] **Address-based PC selection at HBM CTRL** (replace the
|
- [ ] **Address-based PC selection at HBM CTRL** (replace the
|
||||||
address-blind global round-robin). When two transactions of size
|
address-blind global round-robin). When two transactions of size
|
||||||
`num_pcs × burst_bytes` (e.g., 2KB at 8 PCs × 256B) arrive
|
`num_pcs × burst_bytes` (e.g., 2KB at 8 PCs × 256B) arrive
|
||||||
concurrently, both claim PCs 0..7 via global RR, producing full
|
concurrently, both claim PCs 0..7 via global RR, producing full
|
||||||
per-PC contention. Real HW uses address bits to select PCs, so
|
per-PC contention even when real-HW address striping would put
|
||||||
different-address transactions hit different PC patterns. Address
|
them on disjoint PC sets. Directly affects multi-PE concurrent
|
||||||
modeling would let the simulator reflect cache-line/page-aware
|
HBM workload latencies.
|
||||||
layouts.
|
|
||||||
- [ ] **Bank-level conflict modeling** within a PC (opt-in via
|
- [ ] **Bank-level conflict modeling** within a PC (opt-in via
|
||||||
`track_banks: true`). Currently we assume no same-bank reuse.
|
`track_banks: true`). Currently we assume no same-bank reuse;
|
||||||
|
random scatter/gather workloads are optimistic here.
|
||||||
- [ ] **HBM scheduler** with write buffer + watermark drain (Tier 2
|
- [ ] **HBM scheduler** with write buffer + watermark drain (Tier 2
|
||||||
from the design discussion). Default `switch_penalty_ns=0` is the
|
from the design discussion). Default `switch_penalty_ns=0` is the
|
||||||
ideal-amortization stand-in.
|
ideal-amortization stand-in; bursty mixed R/W workloads benefit
|
||||||
- [ ] **Backpressure** modeling for finite component buffers.
|
from explicit modeling.
|
||||||
|
- [ ] **Backpressure** modeling for finite component buffers. Matters
|
||||||
|
at high concurrency / sustained saturation where buffer occupancy
|
||||||
|
causes upstream stalls.
|
||||||
- [ ] **Op_log integration with chunk-streaming**: currently op_log
|
- [ ] **Op_log integration with chunk-streaming**: currently op_log
|
||||||
fires on PE-internal command messages (DmaReadCmd, DmaWriteCmd,
|
fires on PE-internal command messages (DmaReadCmd, DmaWriteCmd,
|
||||||
GemmCmd, MathCmd) which are not chunkified. Integration would
|
GemmCmd, MathCmd) which are not chunkified. Integration would
|
||||||
require flit-aware components to also emit op_log start/end hooks
|
require flit-aware components to also emit op_log start/end hooks
|
||||||
per transaction (start on first flit, end on is_last).
|
per transaction (start on first flit, end on is_last).
|
||||||
|
|
||||||
|
**Lower impact (academic / specific use cases)**:
|
||||||
|
|
||||||
|
- [ ] **Cycle-accurate router arbitration policies** (RR with
|
||||||
|
priorities, age, iSLIP). The FIFO inbox is already approximately
|
||||||
|
fair when flit arrival times differ slightly between streams (the
|
||||||
|
common case for similar-rate workloads). True impact appears only
|
||||||
|
for: (a) priority/QoS modeling, (b) per-stream tail latency
|
||||||
|
analysis under sustained saturation. Not critical for makespan or
|
||||||
|
average-latency studies.
|
||||||
|
- [ ] **Sub-flit (32B) granularity** for finer wire arbitration
|
||||||
|
cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates
|
||||||
|
per 32B flit. Effect is small for most workloads (sub-flit timing
|
||||||
|
noise on small messages).
|
||||||
|
|
||||||
## Consequences
|
## Consequences
|
||||||
|
|
||||||
- Single review point for all model fidelity questions. Each future PR
|
- Single review point for all model fidelity questions. Each future PR
|
||||||
|
|||||||
@@ -1,13 +1,13 @@
|
|||||||
buffer_kind,sip_topology,n_sips,n_elem,bytes_per_pe,latency_ns
|
buffer_kind,sip_topology,n_sips,n_elem,bytes_per_pe,latency_ns
|
||||||
hbm,torus_2d,6,128,256,1858.0399999999827
|
hbm,torus_2d,6,128,256,2144.0399999999754
|
||||||
hbm,torus_2d,6,1024,2048,2389.0399999999827
|
hbm,torus_2d,6,1024,2048,2908.74499999995
|
||||||
hbm,torus_2d,6,8192,16384,6673.039999999986
|
hbm,torus_2d,6,8192,16384,8851.185000000081
|
||||||
hbm,torus_2d,6,32768,65536,21361.03999999992
|
hbm,torus_2d,6,32768,65536,29225.265000008752
|
||||||
sram,torus_2d,6,128,256,1774.0399999999827
|
sram,torus_2d,6,128,256,2060.0399999999754
|
||||||
sram,torus_2d,6,1024,2048,2389.0399999999827
|
sram,torus_2d,6,1024,2048,2908.74499999995
|
||||||
sram,torus_2d,6,8192,16384,7345.039999999986
|
sram,torus_2d,6,8192,16384,9523.185000000081
|
||||||
sram,torus_2d,6,32768,65536,24337.039999999935
|
sram,torus_2d,6,32768,65536,32201.265000008752
|
||||||
tcm,torus_2d,6,128,256,1678.0399999999827
|
tcm,torus_2d,6,128,256,1964.0399999999754
|
||||||
tcm,torus_2d,6,1024,2048,1957.0399999999827
|
tcm,torus_2d,6,1024,2048,2476.74499999995
|
||||||
tcm,torus_2d,6,8192,16384,4225.039999999986
|
tcm,torus_2d,6,8192,16384,6403.185000000081
|
||||||
tcm,torus_2d,6,32768,65536,12001.03999999992
|
tcm,torus_2d,6,32768,65536,19865.265000008738
|
||||||
|
|||||||
|
|
Before Width: | Height: | Size: 74 KiB After Width: | Height: | Size: 76 KiB |
|
Before Width: | Height: | Size: 40 KiB After Width: | Height: | Size: 39 KiB |
|
Before Width: | Height: | Size: 82 KiB After Width: | Height: | Size: 79 KiB |
|
Before Width: | Height: | Size: 38 KiB After Width: | Height: | Size: 37 KiB |
@@ -1,37 +1,37 @@
|
|||||||
algorithm,sip_topology,n_sips,n_elem,bytes_per_pe,bytes_per_sip,latency_ns
|
algorithm,sip_topology,n_sips,n_elem,bytes_per_pe,bytes_per_sip,latency_ns
|
||||||
intercube_allreduce,mesh_2d_no_wrap,6,8,16,256,2626.302499999998
|
intercube_allreduce,mesh_2d_no_wrap,6,8,16,256,2666.5524999999725
|
||||||
intercube_allreduce,mesh_2d_no_wrap,6,32,64,1024,2634.7399999999952
|
intercube_allreduce,mesh_2d_no_wrap,6,32,64,1024,2747.7399999999725
|
||||||
intercube_allreduce,mesh_2d_no_wrap,6,64,128,2048,2645.9899999999925
|
intercube_allreduce,mesh_2d_no_wrap,6,64,128,2048,2855.98999999998
|
||||||
intercube_allreduce,mesh_2d_no_wrap,6,128,256,4096,2668.489999999987
|
intercube_allreduce,mesh_2d_no_wrap,6,128,256,4096,3072.4899999999725
|
||||||
intercube_allreduce,mesh_2d_no_wrap,6,512,1024,16384,2812.489999999987
|
intercube_allreduce,mesh_2d_no_wrap,6,512,1024,16384,3336.579999999951
|
||||||
intercube_allreduce,mesh_2d_no_wrap,6,1024,2048,32768,3010.489999999987
|
intercube_allreduce,mesh_2d_no_wrap,6,1024,2048,32768,3707.49999999992
|
||||||
intercube_allreduce,mesh_2d_no_wrap,6,2048,4096,65536,3406.489999999987
|
intercube_allreduce,mesh_2d_no_wrap,6,2048,4096,65536,4449.339999999875
|
||||||
intercube_allreduce,mesh_2d_no_wrap,6,4096,8192,131072,4198.489999999965
|
intercube_allreduce,mesh_2d_no_wrap,6,4096,8192,131072,5933.020000000055
|
||||||
intercube_allreduce,mesh_2d_no_wrap,6,8192,16384,262144,5782.489999999969
|
intercube_allreduce,mesh_2d_no_wrap,6,8192,16384,262144,8900.380000000157
|
||||||
intercube_allreduce,mesh_2d_no_wrap,6,16384,32768,524288,8950.489999999925
|
intercube_allreduce,mesh_2d_no_wrap,6,16384,32768,524288,14835.099999997583
|
||||||
intercube_allreduce,mesh_2d_no_wrap,6,32768,65536,1048576,15286.48999999986
|
intercube_allreduce,mesh_2d_no_wrap,6,32768,65536,1048576,26704.540000017492
|
||||||
intercube_allreduce,mesh_2d_no_wrap,6,49152,98304,1572864,21622.489999999932
|
intercube_allreduce,mesh_2d_no_wrap,6,49152,98304,1572864,38573.980000026335
|
||||||
intercube_allreduce,ring_1d,6,8,16,256,2302.9849999999933
|
intercube_allreduce,ring_1d,6,8,16,256,2365.2558333333036
|
||||||
intercube_allreduce,ring_1d,6,32,64,1024,2310.8599999999906
|
intercube_allreduce,ring_1d,6,32,64,1024,2436.9433333333036
|
||||||
intercube_allreduce,ring_1d,6,64,128,2048,2321.359999999988
|
intercube_allreduce,ring_1d,6,64,128,2048,2532.526666666643
|
||||||
intercube_allreduce,ring_1d,6,128,256,4096,2342.3599999999824
|
intercube_allreduce,ring_1d,6,128,256,4096,2723.6933333333036
|
||||||
intercube_allreduce,ring_1d,6,512,1024,16384,2479.3599999999824
|
intercube_allreduce,ring_1d,6,512,1024,16384,3042.0349999999544
|
||||||
intercube_allreduce,ring_1d,6,1024,2048,32768,2669.3599999999824
|
intercube_allreduce,ring_1d,6,1024,2048,32768,3390.201666666597
|
||||||
intercube_allreduce,ring_1d,6,2048,4096,65536,3049.3599999999824
|
intercube_allreduce,ring_1d,6,2048,4096,65536,4079.7349999998714
|
||||||
intercube_allreduce,ring_1d,6,4096,8192,131072,3809.3599999999715
|
intercube_allreduce,ring_1d,6,4096,8192,131072,5458.801666666721
|
||||||
intercube_allreduce,ring_1d,6,8192,16384,262144,5329.359999999979
|
intercube_allreduce,ring_1d,6,8192,16384,262144,8216.93500000014
|
||||||
intercube_allreduce,ring_1d,6,16384,32768,524288,8369.35999999992
|
intercube_allreduce,ring_1d,6,16384,32768,524288,13733.201666664638
|
||||||
intercube_allreduce,ring_1d,6,32768,65536,1048576,14449.359999999899
|
intercube_allreduce,ring_1d,6,32768,65536,1048576,24765.735000014545
|
||||||
intercube_allreduce,ring_1d,6,49152,98304,1572864,20529.35999999997
|
intercube_allreduce,ring_1d,6,49152,98304,1572864,35798.268333355256
|
||||||
intercube_allreduce,torus_2d,6,8,16,256,1644.2899999999936
|
intercube_allreduce,torus_2d,6,8,16,256,1700.6024999999754
|
||||||
intercube_allreduce,torus_2d,6,32,64,1024,1651.0399999999909
|
intercube_allreduce,torus_2d,6,32,64,1024,1753.2899999999754
|
||||||
intercube_allreduce,torus_2d,6,64,128,2048,1660.0399999999881
|
intercube_allreduce,torus_2d,6,64,128,2048,1823.539999999979
|
||||||
intercube_allreduce,torus_2d,6,128,256,4096,1678.0399999999827
|
intercube_allreduce,torus_2d,6,128,256,4096,1964.0399999999754
|
||||||
intercube_allreduce,torus_2d,6,512,1024,16384,1795.0399999999827
|
intercube_allreduce,torus_2d,6,512,1024,16384,2196.2849999999653
|
||||||
intercube_allreduce,torus_2d,6,1024,2048,32768,1957.0399999999827
|
intercube_allreduce,torus_2d,6,1024,2048,32768,2476.74499999995
|
||||||
intercube_allreduce,torus_2d,6,2048,4096,65536,2281.0399999999827
|
intercube_allreduce,torus_2d,6,2048,4096,65536,3037.664999999919
|
||||||
intercube_allreduce,torus_2d,6,4096,8192,131072,2929.039999999979
|
intercube_allreduce,torus_2d,6,4096,8192,131072,4159.50500000003
|
||||||
intercube_allreduce,torus_2d,6,8192,16384,262144,4225.039999999986
|
intercube_allreduce,torus_2d,6,8192,16384,262144,6403.185000000081
|
||||||
intercube_allreduce,torus_2d,6,16384,32768,524288,6817.039999999943
|
intercube_allreduce,torus_2d,6,16384,32768,524288,10890.544999998769
|
||||||
intercube_allreduce,torus_2d,6,32768,65536,1048576,12001.03999999992
|
intercube_allreduce,torus_2d,6,32768,65536,1048576,19865.265000008738
|
||||||
intercube_allreduce,torus_2d,6,49152,98304,1572864,17185.039999999994
|
intercube_allreduce,torus_2d,6,49152,98304,1572864,28839.985000013185
|
||||||
|
|||||||
|
|
Before Width: | Height: | Size: 38 KiB After Width: | Height: | Size: 36 KiB |
|
Before Width: | Height: | Size: 233 KiB After Width: | Height: | Size: 233 KiB |
|
Before Width: | Height: | Size: 166 KiB After Width: | Height: | Size: 165 KiB |
|
Before Width: | Height: | Size: 50 KiB After Width: | Height: | Size: 47 KiB |
|
Before Width: | Height: | Size: 49 KiB After Width: | Height: | Size: 47 KiB |
|
Before Width: | Height: | Size: 54 KiB After Width: | Height: | Size: 51 KiB |
|
Before Width: | Height: | Size: 53 KiB After Width: | Height: | Size: 43 KiB |
|
Before Width: | Height: | Size: 109 KiB After Width: | Height: | Size: 111 KiB |
@@ -1,81 +1,81 @@
|
|||||||
hop,label,size_bytes,path,total_ns
|
hop,label,size_bytes,path,total_ns
|
||||||
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,ipcq,31.3899999999976
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,ipcq,42.8899999999976
|
||||||
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,raw,12.019999999996799
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,raw,29.0199999999968
|
||||||
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,ipcq,33.1399999999976
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,ipcq,48.1399999999976
|
||||||
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,raw,13.019999999996799
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,raw,31.0199999999968
|
||||||
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,ipcq,34.8899999999976
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,ipcq,50.3899999999976
|
||||||
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,raw,14.019999999996799
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,raw,32.0199999999968
|
||||||
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,ipcq,36.6399999999976
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,ipcq,52.6399999999976
|
||||||
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,raw,15.019999999996799
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,raw,33.0199999999968
|
||||||
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,ipcq,40.1399999999976
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,ipcq,57.1399999999976
|
||||||
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,raw,17.0199999999968
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,raw,35.0199999999968
|
||||||
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,ipcq,43.6399999999976
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,ipcq,62.6399999999976
|
||||||
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,raw,19.0199999999968
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,raw,37.0199999999968
|
||||||
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,ipcq,57.6399999999976
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,ipcq,84.6399999999976
|
||||||
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,raw,27.0199999999968
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,raw,45.0199999999968
|
||||||
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,ipcq,85.6399999999976
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,ipcq,128.6399999999976
|
||||||
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,raw,43.0199999999968
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,raw,61.0199999999968
|
||||||
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,ipcq,141.64000000000306
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,ipcq,216.64000000000306
|
||||||
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,raw,75.02000000000407
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,raw,93.02000000000407
|
||||||
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,ipcq,169.64000000000306
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,ipcq,260.64000000000306
|
||||||
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,raw,91.02000000000407
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,raw,109.02000000000407
|
||||||
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,ipcq,31.3899999999976
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,ipcq,42.8899999999976
|
||||||
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,raw,12.019999999996799
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,raw,29.0199999999968
|
||||||
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,ipcq,33.1399999999976
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,ipcq,48.1399999999976
|
||||||
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,raw,13.019999999996799
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,raw,31.0199999999968
|
||||||
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,ipcq,34.8899999999976
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,ipcq,50.3899999999976
|
||||||
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,raw,14.019999999996799
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,raw,32.0199999999968
|
||||||
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,ipcq,36.6399999999976
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,ipcq,52.6399999999976
|
||||||
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,raw,15.019999999996799
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,raw,33.0199999999968
|
||||||
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,ipcq,40.1399999999976
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,ipcq,57.1399999999976
|
||||||
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,raw,17.0199999999968
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,raw,35.0199999999968
|
||||||
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,ipcq,43.6399999999976
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,ipcq,62.6399999999976
|
||||||
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,raw,19.0199999999968
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,raw,37.0199999999968
|
||||||
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,ipcq,57.6399999999976
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,ipcq,84.6399999999976
|
||||||
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,raw,27.0199999999968
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,raw,45.0199999999968
|
||||||
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,ipcq,85.6399999999976
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,ipcq,128.6399999999976
|
||||||
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,raw,43.0199999999968
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,raw,61.0199999999968
|
||||||
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,ipcq,141.64000000000306
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,ipcq,216.64000000000306
|
||||||
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,raw,75.02000000000407
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,raw,93.02000000000407
|
||||||
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,ipcq,169.64000000000306
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,ipcq,260.64000000000306
|
||||||
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,raw,91.02000000000407
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,raw,109.02000000000407
|
||||||
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,ipcq,67.40999999999804
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,ipcq,81.15999999999804
|
||||||
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,raw,68.53999999999724
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,raw,89.28999999999724
|
||||||
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,ipcq,69.15999999999804
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,ipcq,88.65999999999804
|
||||||
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,raw,70.03999999999724
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,raw,95.53999999999724
|
||||||
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,ipcq,70.90999999999804
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,ipcq,90.90999999999804
|
||||||
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,raw,71.53999999999724
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,raw,96.53999999999724
|
||||||
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,ipcq,72.65999999999804
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,ipcq,93.15999999999804
|
||||||
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,raw,73.03999999999724
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,raw,97.53999999999724
|
||||||
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,ipcq,76.15999999999804
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,ipcq,97.65999999999804
|
||||||
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,raw,76.03999999999724
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,raw,99.53999999999724
|
||||||
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,ipcq,79.65999999999804
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,ipcq,103.15999999999804
|
||||||
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,raw,79.03999999999724
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,raw,102.53999999999724
|
||||||
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,ipcq,93.65999999999804
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,ipcq,125.15999999999804
|
||||||
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,raw,91.03999999999724
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,raw,114.53999999999724
|
||||||
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,ipcq,121.65999999999804
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,ipcq,169.15999999999804
|
||||||
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,raw,115.03999999999724
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,raw,138.53999999999724
|
||||||
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,ipcq,177.65999999999985
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,ipcq,257.15999999999985
|
||||||
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,raw,163.04000000000087
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,raw,186.54000000000087
|
||||||
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,ipcq,205.65999999999985
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,ipcq,301.15999999999985
|
||||||
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,raw,187.04000000000087
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,raw,210.54000000000087
|
||||||
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,ipcq,87.40999999999804
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,ipcq,103.15999999999804
|
||||||
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,raw,88.53999999999724
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,raw,111.28999999999724
|
||||||
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,ipcq,89.15999999999804
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,ipcq,112.65999999999804
|
||||||
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,raw,90.03999999999724
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,raw,119.53999999999724
|
||||||
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,ipcq,90.90999999999804
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,ipcq,114.90999999999804
|
||||||
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,raw,91.53999999999724
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,raw,120.53999999999724
|
||||||
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,ipcq,92.65999999999804
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,ipcq,117.15999999999804
|
||||||
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,raw,93.03999999999724
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,raw,121.53999999999724
|
||||||
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,ipcq,96.15999999999804
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,ipcq,121.65999999999804
|
||||||
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,raw,96.03999999999724
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,raw,123.53999999999724
|
||||||
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,ipcq,99.65999999999804
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,ipcq,127.15999999999804
|
||||||
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,raw,99.03999999999724
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,raw,126.53999999999724
|
||||||
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,ipcq,113.65999999999804
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,ipcq,149.15999999999804
|
||||||
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,raw,111.03999999999724
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,raw,138.53999999999724
|
||||||
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,ipcq,141.65999999999804
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,ipcq,193.15999999999804
|
||||||
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,raw,135.03999999999724
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,raw,162.53999999999724
|
||||||
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,ipcq,197.65999999999985
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,ipcq,281.15999999999985
|
||||||
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,raw,183.04000000000087
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,raw,210.54000000000087
|
||||||
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,ipcq,225.65999999999985
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,ipcq,325.15999999999985
|
||||||
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,raw,207.04000000000087
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,raw,234.54000000000087
|
||||||
|
|||||||
|
|
After Width: | Height: | Size: 368 KiB |
@@ -1,33 +1,101 @@
|
|||||||
<svg xmlns="http://www.w3.org/2000/svg" width="500" height="360" viewBox="0 0 500 360">
|
<svg xmlns="http://www.w3.org/2000/svg" width="560" height="420" viewBox="0 0 560 420">
|
||||||
<title>pe</title>
|
<title>pe</title>
|
||||||
<rect width="500" height="360" fill="#f8fafc"/>
|
<rect width="560" height="420" fill="#f8fafc"/>
|
||||||
<text x="250" y="18" text-anchor="middle" font-family="monospace" font-size="14" font-weight="bold" fill="#1e293b">PE VIEW</text>
|
<text x="280" y="18" text-anchor="middle" font-family="monospace" font-size="14" font-weight="bold" fill="#1e293b">PE VIEW</text>
|
||||||
<line x1="92.5" y1="180.0" x2="180.0" y2="180.0" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
|
|
||||||
<text x="136.2" y="176.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">0.5mm</text>
|
<!-- ── Boxes ── -->
|
||||||
<polyline points="180.0,180.0 180.0,92.5 285.0,92.5" fill="none" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
|
|
||||||
<text x="232.5" y="132.2" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">0.5mm</text>
|
<!-- PE CPU -->
|
||||||
<line x1="180.0" y1="180.0" x2="285.0" y2="180.0" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
|
<rect x="48.8" y="185.5" width="87.5" height="49.0" rx="4" fill="#ef4444" stroke="#475569" stroke-width="1"/>
|
||||||
<text x="232.5" y="176.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">0.5mm</text>
|
<text x="92.5" y="214.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">PE CPU</text>
|
||||||
<polyline points="180.0,180.0 180.0,267.5 285.0,267.5" fill="none" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
|
|
||||||
<text x="232.5" y="219.8" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">0.5mm</text>
|
<!-- PE SCHEDULER -->
|
||||||
<polyline points="285.0,92.5 390.0,92.5 390.0,180.0" fill="none" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
|
<rect x="156.2" y="185.5" width="87.5" height="49.0" rx="4" fill="#f59e0b" stroke="#475569" stroke-width="1"/>
|
||||||
<text x="337.5" y="132.2" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">0.5mm 512GB/s</text>
|
<text x="200.0" y="214.0" text-anchor="middle" font-family="monospace" font-size="9" fill="#1e293b">PE SCHEDULER</text>
|
||||||
<line x1="285.0" y1="180.0" x2="390.0" y2="180.0" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
|
|
||||||
<text x="337.5" y="176.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">0.5mm 512GB/s</text>
|
<!-- PE_IPCQ (control plane) — new -->
|
||||||
<polyline points="285.0,267.5 390.0,267.5 390.0,180.0" fill="none" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
|
<rect x="48.8" y="68.0" width="105" height="49.0" rx="4" fill="#0ea5e9" stroke="#0277bd" stroke-width="1.5" stroke-dasharray="5,3"/>
|
||||||
<text x="337.5" y="219.8" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">0.5mm 512GB/s</text>
|
<text x="101.3" y="89.0" text-anchor="middle" font-family="monospace" font-size="9" fill="#ffffff">PE IPCQ</text>
|
||||||
<rect x="48.8" y="155.5" width="87.5" height="49.0" rx="4" fill="#ef4444" stroke="#475569" stroke-width="1"/>
|
<text x="101.3" y="102.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#e0f2fe">(control plane)</text>
|
||||||
<text x="92.5" y="184.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">PE CPU</text>
|
|
||||||
<rect x="136.2" y="155.5" width="87.5" height="49.0" rx="4" fill="#f59e0b" stroke="#475569" stroke-width="1"/>
|
<!-- PE MMU -->
|
||||||
<text x="180.0" y="184.0" text-anchor="middle" font-family="monospace" font-size="9" fill="#1e293b">PE SCHEDULER</text>
|
<rect x="173.8" y="68.0" width="87.5" height="49.0" rx="4" fill="#e2e8f0" stroke="#475569" stroke-width="1"/>
|
||||||
<rect x="241.2" y="68.0" width="87.5" height="49.0" rx="4" fill="#3b82f6" stroke="#475569" stroke-width="1"/>
|
<text x="217.5" y="96.5" text-anchor="middle" font-family="monospace" font-size="10" fill="#1e293b">PE MMU</text>
|
||||||
<text x="285.0" y="96.5" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">PE DMA</text>
|
|
||||||
<rect x="241.2" y="155.5" width="87.5" height="49.0" rx="4" fill="#8b5cf6" stroke="#475569" stroke-width="1"/>
|
<!-- PE DMA -->
|
||||||
<text x="285.0" y="184.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">PE GEMM</text>
|
<rect x="281.2" y="68.0" width="87.5" height="49.0" rx="4" fill="#3b82f6" stroke="#475569" stroke-width="1"/>
|
||||||
<rect x="241.2" y="243.0" width="87.5" height="49.0" rx="4" fill="#ec4899" stroke="#475569" stroke-width="1"/>
|
<text x="325.0" y="96.5" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">PE DMA</text>
|
||||||
<text x="285.0" y="271.5" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">PE MATH</text>
|
|
||||||
<rect x="136.2" y="68.0" width="87.5" height="49.0" rx="4" fill="#e2e8f0" stroke="#475569" stroke-width="1"/>
|
<!-- PE GEMM -->
|
||||||
<text x="180.0" y="96.5" text-anchor="middle" font-family="monospace" font-size="10" fill="#1e293b">PE MMU</text>
|
<rect x="281.2" y="185.5" width="87.5" height="49.0" rx="4" fill="#8b5cf6" stroke="#475569" stroke-width="1"/>
|
||||||
<rect x="346.2" y="155.5" width="87.5" height="49.0" rx="4" fill="#10b981" stroke="#475569" stroke-width="1"/>
|
<text x="325.0" y="214.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">PE GEMM</text>
|
||||||
<text x="390.0" y="184.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">PE TCM</text>
|
|
||||||
</svg>
|
<!-- PE MATH -->
|
||||||
|
<rect x="281.2" y="283.0" width="87.5" height="49.0" rx="4" fill="#ec4899" stroke="#475569" stroke-width="1"/>
|
||||||
|
<text x="325.0" y="311.5" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">PE MATH</text>
|
||||||
|
|
||||||
|
<!-- PE TCM (with IPCQ Slot Region) -->
|
||||||
|
<rect x="396.2" y="155.5" width="120" height="100" rx="4" fill="#10b981" stroke="#475569" stroke-width="1"/>
|
||||||
|
<text x="456.2" y="180.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">PE TCM</text>
|
||||||
|
<!-- IPCQ Slot Region inside TCM -->
|
||||||
|
<rect x="406.2" y="193.0" width="100" height="28" rx="3" fill="#065f46" stroke="#ffffff" stroke-width="1" stroke-dasharray="4,2" opacity="0.7"/>
|
||||||
|
<text x="456.2" y="211.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#d1fae5">IPCQ Slot Region</text>
|
||||||
|
|
||||||
|
<!-- ── Connections (edges) ── -->
|
||||||
|
|
||||||
|
<!-- PE CPU → PE SCHEDULER -->
|
||||||
|
<line x1="136.3" y1="210.0" x2="156.2" y2="210.0" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
|
||||||
|
<text x="146.2" y="205.0" text-anchor="middle" font-family="monospace" font-size="6" fill="#64748b">cmd</text>
|
||||||
|
|
||||||
|
<!-- PE CPU → PE_IPCQ (IpcqRequest) -->
|
||||||
|
<line x1="92.5" y1="185.5" x2="92.5" y2="117.0" stroke="#0277bd" stroke-width="1.5"/>
|
||||||
|
<polygon points="92.5,117.0 89.5,123.0 95.5,123.0" fill="#0277bd"/>
|
||||||
|
<text x="77" y="152.0" text-anchor="middle" font-family="monospace" font-size="6" fill="#0277bd" transform="rotate(-90,77,152)">IpcqRequest</text>
|
||||||
|
|
||||||
|
<!-- PE SCHEDULER → PE DMA (TileToken, compute port) -->
|
||||||
|
<polyline points="200.0,185.5 200.0,92.5 281.2,92.5" fill="none" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
|
||||||
|
<text x="240.0" y="86.5" text-anchor="middle" font-family="monospace" font-size="6" fill="#64748b">TileToken</text>
|
||||||
|
|
||||||
|
<!-- PE SCHEDULER → PE GEMM -->
|
||||||
|
<line x1="243.7" y1="210.0" x2="281.2" y2="210.0" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
|
||||||
|
|
||||||
|
<!-- PE SCHEDULER → PE MATH -->
|
||||||
|
<polyline points="200.0,234.5 200.0,307.5 281.2,307.5" fill="none" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
|
||||||
|
|
||||||
|
<!-- PE DMA ↔ PE TCM -->
|
||||||
|
<line x1="368.7" y1="92.5" x2="456.2" y2="155.5" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
|
||||||
|
<text x="420.0" y="118.0" text-anchor="middle" font-family="monospace" font-size="6" fill="#64748b">DMA R/W</text>
|
||||||
|
|
||||||
|
<!-- PE GEMM → PE TCM -->
|
||||||
|
<line x1="368.7" y1="210.0" x2="396.2" y2="210.0" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
|
||||||
|
<text x="382.4" y="205.0" text-anchor="middle" font-family="monospace" font-size="6" fill="#64748b">512GB/s</text>
|
||||||
|
|
||||||
|
<!-- PE MATH → PE TCM -->
|
||||||
|
<polyline points="368.7,307.5 456.2,307.5 456.2,255.5" fill="none" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
|
||||||
|
<text x="412.4" y="301.0" text-anchor="middle" font-family="monospace" font-size="6" fill="#64748b">512GB/s</text>
|
||||||
|
|
||||||
|
<!-- PE_IPCQ → PE DMA (IpcqDmaToken, IPCQ port) — blue -->
|
||||||
|
<line x1="153.8" y1="82.0" x2="281.2" y2="82.0" stroke="#1565c0" stroke-width="1.5"/>
|
||||||
|
<polygon points="281.2,82.0 275.2,79.0 275.2,85.0" fill="#1565c0"/>
|
||||||
|
<text x="217.5" y="77.0" text-anchor="middle" font-family="monospace" font-size="6" fill="#1565c0">IpcqDmaToken</text>
|
||||||
|
|
||||||
|
<!-- PE DMA → PE_IPCQ (IpcqMetaArrival) — blue -->
|
||||||
|
<line x1="281.2" y1="102.0" x2="153.8" y2="102.0" stroke="#1565c0" stroke-width="1.5"/>
|
||||||
|
<polygon points="153.8,102.0 159.8,99.0 159.8,105.0" fill="#1565c0"/>
|
||||||
|
<text x="217.5" y="113.0" text-anchor="middle" font-family="monospace" font-size="6" fill="#1565c0">IpcqMetaArrival</text>
|
||||||
|
|
||||||
|
<!-- PE_IPCQ → PE DMA (IpcqCreditMetadata, dashed purple) -->
|
||||||
|
<line x1="153.8" y1="92.5" x2="281.2" y2="92.5" stroke="#7b1fa2" stroke-width="1" stroke-dasharray="4,3"/>
|
||||||
|
<text x="217.5" y="62.0" text-anchor="middle" font-family="monospace" font-size="6" fill="#7b1fa2">IpcqCreditMeta (dashed)</text>
|
||||||
|
|
||||||
|
<!-- ── Legend ── -->
|
||||||
|
<rect x="15" y="365" width="530" height="45" rx="4" fill="#f1f5f9" stroke="#cbd5e1" stroke-width="0.5"/>
|
||||||
|
<line x1="25" y1="385" x2="55" y2="385" stroke="#1565c0" stroke-width="1.5"/>
|
||||||
|
<text x="60" y="388" font-family="monospace" font-size="7" fill="#1e293b">IPCQ data path</text>
|
||||||
|
<line x1="140" y1="385" x2="170" y2="385" stroke="#7b1fa2" stroke-width="1" stroke-dasharray="4,3"/>
|
||||||
|
<text x="175" y="388" font-family="monospace" font-size="7" fill="#1e293b">IPCQ credit return</text>
|
||||||
|
<line x1="290" y1="385" x2="320" y2="385" stroke="#94a3b8" stroke-width="1.5"/>
|
||||||
|
<text x="325" y="388" font-family="monospace" font-size="7" fill="#1e293b">Compute data path</text>
|
||||||
|
<rect x="430" y="378" width="40" height="14" rx="2" fill="none" stroke="#0277bd" stroke-width="1" stroke-dasharray="4,2"/>
|
||||||
|
<text x="475" y="388" font-family="monospace" font-size="7" fill="#1e293b">IPCQ (new)</text>
|
||||||
|
</svg>
|
||||||
|
|||||||
|
Before Width: | Height: | Size: 3.4 KiB After Width: | Height: | Size: 6.6 KiB |
@@ -0,0 +1,237 @@
|
|||||||
|
# Hardware Architecture Overview
|
||||||
|
|
||||||
|
본 문서는 AI Accelerator 플랫폼의 하드웨어 아키텍처를 요약한다.
|
||||||
|
논문 분석 및 설계 검토 시 배경 지식으로 사용할 수 있다.
|
||||||
|
|
||||||
|
> Source ADRs: ADR-0003, ADR-0004, ADR-0014, ADR-0017, ADR-0022
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. System Hierarchy
|
||||||
|
|
||||||
|
시스템은 4단계 계층으로 구성된다.
|
||||||
|
|
||||||
|
```
|
||||||
|
Tray
|
||||||
|
├── Host CPU (runtime, data placement)
|
||||||
|
├── SIP 0 (accelerator)
|
||||||
|
│ ├── IO Chiplet (PCIe-EP, IO_CPU)
|
||||||
|
│ ├── CUBE 0
|
||||||
|
│ │ ├── PE 0 ─ PE 7
|
||||||
|
│ │ ├── HBM + HBM_CTRL
|
||||||
|
│ │ ├── Shared SRAM
|
||||||
|
│ │ ├── M_CPU (management)
|
||||||
|
│ │ ├── NOC 2D Mesh (router grid)
|
||||||
|
│ │ └── UCIe × 4 (N/S/E/W)
|
||||||
|
│ ├── CUBE 1 ... CUBE N
|
||||||
|
│ └── IO Chiplet(s)
|
||||||
|
├── SIP 1 ... SIP M
|
||||||
|
└── Interconnect (PCIe / UAL)
|
||||||
|
```
|
||||||
|
|
||||||
|
| Level | 구성 | 연결 |
|
||||||
|
|-------|------|------|
|
||||||
|
| **Tray** | Host CPU + 여러 SIP | PCIe / UAL fabric |
|
||||||
|
| **SIP** | 여러 CUBE + IO chiplet(s) | UCIe (cube간), PCIe-EP (host) |
|
||||||
|
| **CUBE** | 여러 PE + HBM + SRAM + M_CPU + NOC mesh | UCIe × 4 ports (N/S/E/W) |
|
||||||
|
| **PE** | PE_CPU + DMA + GEMM + MATH + TCM | NOC router 직결 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. CUBE Architecture
|
||||||
|
|
||||||
|
각 CUBE는 독립적인 compute + memory unit이다.
|
||||||
|
|
||||||
|
### 2.1 Components
|
||||||
|
|
||||||
|
- **PEs**: 복수의 Processing Element, 각각 독립 커널 실행 가능
|
||||||
|
- **HBM + HBM_CTRL**: High Bandwidth Memory. 각 PE에 local HBM 영역이 할당되어 최소 latency로 접근
|
||||||
|
- **Shared SRAM**: Cube 내 모든 PE가 NOC를 통해 접근 가능한 공유 메모리
|
||||||
|
- **M_CPU**: Management CPU. 커널 command 분배 및 completion 집계
|
||||||
|
- **NOC (On-die Fabric)**: Cube 내 모든 컴포넌트를 연결하는 interconnect
|
||||||
|
- **UCIe × 4**: 각 방향(N/S/E/W)에 복수 connection, inter-cube 연결
|
||||||
|
|
||||||
|
### 2.2 NOC (On-die Fabric)
|
||||||
|
|
||||||
|
NOC는 cube 내 PE, HBM, SRAM, M_CPU, UCIe를 연결하는 on-die interconnect이다.
|
||||||
|
|
||||||
|
**아키텍처 요구사항** (topology 무관):
|
||||||
|
- 모든 PE가 local HBM에 full bandwidth로 접근 가능
|
||||||
|
- 모든 PE가 shared SRAM에 접근 가능
|
||||||
|
- 모든 PE가 UCIe를 통해 다른 cube에 접근 가능
|
||||||
|
- M_CPU가 모든 PE에 command를 전달 가능
|
||||||
|
- Per-link contention 모델링 지원
|
||||||
|
|
||||||
|
**현재 시뮬레이터 구현** (변경 가능):
|
||||||
|
- 2D mesh router grid (6×6 기본, XY deterministic routing)
|
||||||
|
- HBM_CTRL가 각 PE의 local router에 직결 (0 mesh hop)
|
||||||
|
- 중앙 HBM zone에는 router 배치 제외
|
||||||
|
- Contention: directed segment당 capacity=1 resource
|
||||||
|
|
||||||
|
NOC topology는 2D mesh 외에 ring, crossbar, hierarchical 등 다른 구현도 가능하며,
|
||||||
|
아키텍처 요구사항을 만족하는 한 교체 가능하다.
|
||||||
|
|
||||||
|
### 2.3 주요 Data Path
|
||||||
|
|
||||||
|
| Path | Route | 특성 |
|
||||||
|
|------|-------|------|
|
||||||
|
| PE → Local HBM | PE_DMA → NOC → HBM_CTRL | 최소 hop, 256 GB/s (×0.8 eff) |
|
||||||
|
| PE → Remote PE's HBM | PE_DMA → NOC hops → HBM_CTRL | NOC BW/hop에 제한 |
|
||||||
|
| PE → Shared SRAM | PE_DMA → NOC → SRAM | SRAM link BW에 제한 |
|
||||||
|
| PE → Other CUBE's HBM | PE_DMA → NOC → UCIe → NOC → HBM_CTRL | UCIe overhead 16ns (TX+RX) |
|
||||||
|
| Kernel Launch | IO → UCIe → M_CPU → NOC → PE_CPU | Command path |
|
||||||
|
|
||||||
|
### 2.4 Key Bandwidths
|
||||||
|
|
||||||
|
| Connection | Bandwidth | Notes |
|
||||||
|
|------------|-----------|-------|
|
||||||
|
| PE_DMA ↔ NOC | 256 GB/s | HBM slice BW 매칭 |
|
||||||
|
| NOC ↔ HBM_CTRL | 256 GB/s | Per PE, local 접근 |
|
||||||
|
| NOC ↔ SRAM | 128 GB/s × 4 | 512 GB/s aggregate |
|
||||||
|
| NOC ↔ UCIe conn | 128 GB/s × 4 | 512 GB/s per port |
|
||||||
|
| UCIe link (inter-cube) | 512 GB/s | 1.0mm seam distance |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. PE Architecture
|
||||||
|
|
||||||
|
각 PE는 하나의 커널 인스턴스를 실행하는 독립적인 프로세서이다.
|
||||||
|
|
||||||
|
### 3.1 Internal Components
|
||||||
|
|
||||||
|
```
|
||||||
|
PE_CPU (control)
|
||||||
|
│
|
||||||
|
├──→ PE_SCHED (dispatch)
|
||||||
|
│ │
|
||||||
|
│ ├──→ PE_DMA ←→ NOC Router ←→ HBM / SRAM / UCIe
|
||||||
|
│ │ ↕
|
||||||
|
│ ├──→ PE_FETCH_STORE ←→ PE_TCM (16MB SRAM)
|
||||||
|
│ │
|
||||||
|
│ ├──→ PE_GEMM (matrix multiply)
|
||||||
|
│ └──→ PE_MATH (elementwise)
|
||||||
|
│
|
||||||
|
└──→ PE_IPCQ (collective communication)
|
||||||
|
│
|
||||||
|
└──→ PE_DMA (IPCQ port)
|
||||||
|
```
|
||||||
|
|
||||||
|
| Component | 역할 |
|
||||||
|
|-----------|------|
|
||||||
|
| **PE_CPU** | 커널 instruction stream 실행, command 생성 |
|
||||||
|
| **PE_SCHED** | Command dispatcher. Composite command를 tile pipeline으로 분해 |
|
||||||
|
| **PE_DMA** | HBM ↔ TCM 데이터 전송 (NOC router mesh 경유). Read/Write 각 1 channel |
|
||||||
|
| **PE_GEMM** | 행렬 곱 엔진. TCM에서 activation 읽기, HBM에서 weight streaming 가능 |
|
||||||
|
| **PE_MATH** | Element-wise 연산 엔진. TCM 읽기/쓰기 |
|
||||||
|
| **PE_TCM** | 16MB on-PE SRAM. Compute의 staging memory |
|
||||||
|
| **PE_IPCQ** | PE간 collective communication 제어 (ring buffer pointer 관리) |
|
||||||
|
|
||||||
|
### 3.2 Compute Pipeline (Tiled Execution)
|
||||||
|
|
||||||
|
Composite command는 tile 단위로 pipeline 실행된다:
|
||||||
|
|
||||||
|
```
|
||||||
|
DMA_READ(t) → COMPUTE(t) → DMA_WRITE(t)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Overlap 규칙**:
|
||||||
|
- 허용: `DMA_READ(t+1) ∥ COMPUTE(t)`, `DMA_WRITE(t-1) ∥ COMPUTE(t)`
|
||||||
|
- 금지: `GEMM(t) ∥ GEMM(t')`, `GEMM(t) ∥ MATH(t')`
|
||||||
|
|
||||||
|
**DMA Engine**: Read/Write 각각 capacity=1. 동시 Read+Write 가능, 동시 Read+Read 불가.
|
||||||
|
|
||||||
|
**Compute Engine**: GEMM과 MATH가 단일 compute slot 공유. 한 번에 하나만 실행.
|
||||||
|
|
||||||
|
### 3.3 TCM-centric Dataflow
|
||||||
|
|
||||||
|
모든 compute는 TCM을 중심으로 동작한다:
|
||||||
|
|
||||||
|
```
|
||||||
|
Input: HBM → (NOC) → PE_DMA → PE_TCM
|
||||||
|
Compute: PE_TCM → GEMM / MATH → PE_TCM
|
||||||
|
Output: PE_TCM → PE_DMA → (NOC) → HBM
|
||||||
|
```
|
||||||
|
|
||||||
|
PE_TCM은 두 영역으로 분할된다:
|
||||||
|
- **SchedulerReservedTCM**: PE_SCHED 전용 tile buffer 영역 (DMA/compute staging)
|
||||||
|
- **AllocatableTCM**: 범용 할당 영역 (host/DP-visible)
|
||||||
|
|
||||||
|
두 영역은 hard isolation으로 분리된다.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Memory Hierarchy
|
||||||
|
|
||||||
|
### 4.1 Memory Tiers
|
||||||
|
|
||||||
|
| Memory | Scope | Capacity | Bandwidth | Latency | 접근 경로 |
|
||||||
|
|--------|-------|----------|-----------|---------|-----------|
|
||||||
|
| **PE_TCM** | PE 전용 | 16 MB | 512 GB/s | 최저 | 직결 (NOC 미경유) |
|
||||||
|
| **Shared SRAM** | Cube 공유 | 32 MB | 128 GB/s (NoC link) | 중간 | PE → NOC → SRAM |
|
||||||
|
| **Local HBM** | PE별 할당 | Large | 256 GB/s (×0.8 eff) | 높음 | PE → local router → HBM_CTRL |
|
||||||
|
| **Remote HBM** | 다른 PE/Cube | Large | Mesh/UCIe BW 제한 | 최고 | PE → NOC mesh → (UCIe) → HBM_CTRL |
|
||||||
|
|
||||||
|
### 4.2 Local HBM Bandwidth Guarantee
|
||||||
|
|
||||||
|
- 각 PE는 자신의 local router에 직결된 HBM pseudo-channel을 가진다
|
||||||
|
- Local HBM 접근은 **0 mesh hop** (switching overhead만)
|
||||||
|
- Effective bandwidth = spec BW × efficiency factor (default 0.8)
|
||||||
|
- 예: 256 GB/s × 0.8 = 204.8 GB/s effective
|
||||||
|
- 이 보장은 fabric bandwidth와 무관하게 유지된다
|
||||||
|
|
||||||
|
### 4.3 Memory-Centric Design Principle
|
||||||
|
|
||||||
|
- **Compute는 data 근처에서 실행**: PE가 local HBM에 직결되어 데이터 이동 최소화
|
||||||
|
- **TCM은 compute의 scratchpad**: 모든 compute 입출력은 TCM을 경유
|
||||||
|
- **HBM은 primary storage**: 대용량 tensor 저장, DMA로 TCM에 tile 단위 load/store
|
||||||
|
- **Shared SRAM은 cube-level 공유**: 중간 결과 공유, reduction buffer 등
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. SPMD Execution Model
|
||||||
|
|
||||||
|
### 5.1 Program ID Mapping
|
||||||
|
|
||||||
|
커널은 2D hardware grid에서 SPMD 방식으로 실행된다:
|
||||||
|
|
||||||
|
| API | 반환 값 | 설명 |
|
||||||
|
|-----|---------|------|
|
||||||
|
| `tl.program_id(axis=0)` | `local_pe_id` | Cube 내 PE 인덱스 |
|
||||||
|
| `tl.program_id(axis=1)` | `cube_id` | Cube 인덱스 |
|
||||||
|
| `tl.num_programs(axis=0)` | `num_pes_per_cube` | Cube당 PE 수 |
|
||||||
|
| `tl.num_programs(axis=1)` | `num_cubes` | 전체 Cube 수 |
|
||||||
|
|
||||||
|
```python
|
||||||
|
global_pid = tl.program_id(axis=1) * tl.num_programs(axis=0) + tl.program_id(axis=0)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5.2 Axis Mapping Rationale
|
||||||
|
|
||||||
|
- **axis=0 = PE (innermost)**: Cube 내 PE는 HBM을 공유하고 local NOC로 통신. 빠르고 tightly-coupled. GPU의 thread-in-block에 대응.
|
||||||
|
- **axis=1 = Cube (outer)**: Cube 간 통신은 UCIe 경유로 latency 높음. Coarse scheduling 단위. GPU의 block-in-grid에 대응.
|
||||||
|
|
||||||
|
### 5.3 Kernel Execution Flow
|
||||||
|
|
||||||
|
```
|
||||||
|
Host CPU
|
||||||
|
→ IO_CPU (PCIe-EP)
|
||||||
|
→ M_CPU (management, per cube)
|
||||||
|
→ PE_CPU × N (broadcast)
|
||||||
|
→ Each PE executes same kernel with unique (pe_id, cube_id)
|
||||||
|
```
|
||||||
|
|
||||||
|
모든 PE가 동일 커널을 실행하되, `program_id`로 자신의 데이터 파티션을 식별하여
|
||||||
|
독립적으로 처리한다 (SPMD).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Inter-PE Communication (IPCQ)
|
||||||
|
|
||||||
|
PE 간 collective communication은 IPCQ(Inter-PE Communication Queue)를 통해 수행된다.
|
||||||
|
|
||||||
|
- 각 PE는 방향별(N/S/E/W 등) ring buffer 기반 queue pair를 유지
|
||||||
|
- **DMA-IPCQ co-design**: DMA data flit에 head pointer를 piggyback하여 별도 제어 메시지 없이 pointer 동기화
|
||||||
|
- **Credit-based flow control**: Receiver가 slot 소비 후 16B credit으로 sender에게 알림
|
||||||
|
- IPCQ slot buffer는 **TCM, Shared SRAM, Local HBM** 중 선택 가능
|
||||||
|
|
||||||
|
자세한 내용은 `docs/ipcq-dma-codesign-hw.md` 및 ADR-0023 참조.
|
||||||