4 Commits

Author SHA1 Message Date
ywkang a44f832be5 Regenerate latency plots/diagrams for post-Phase-2c model
Allreduce + pe2pe + ipcq + pe_view auto-regenerated by test sweeps
running against the new chunk-streaming wire timing (per-flit
wormhole) — absolute numbers shift upward to reflect bottleneck-link
transit charged once per flit (instead of the previous cut-through
subtraction at HBM CTRL).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 23:24:01 -07:00
ywkang a0cccc71e8 Add HW architecture overview (Korean)
Standalone summary of the modeled hardware hierarchy and components.
Cross-references ADR-0003, 0004, 0014, 0017, 0022.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 23:23:52 -07:00
ywkang 32b29a1e5c ADR-0003/0014: generalize "router mesh" to "NOC"
NOC topology is an implementation choice (mesh, ring, crossbar, etc.).
ADR-0017 covers the current 2D mesh choice; ADRs at the system-level
shouldn't bind to that specific implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 23:23:46 -07:00
ywkang c9bd5387ac ADR-0033 D6: reorder future work by workload impact
Cycle-accurate arbitration policies (priority/iSLIP) downgraded to
"academic / specific use cases" — FIFO inbox is approximately fair
for typical similar-rate workloads (GEMM, AllReduce, data parallel).
True impact appears only for QoS modeling or per-stream tail latency
analysis under saturation.

Higher-priority items pulled forward: address-based PC selection at
HBM CTRL (directly affects multi-PE concurrent HBM contention), bank
conflict modeling, HBM scheduler, finite buffer backpressure, op_log
chunk-streaming integration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 23:21:35 -07:00
21 changed files with 516 additions and 198 deletions
+7 -5
View File
@@ -35,11 +35,13 @@ We model the system hierarchy explicitly:
- A CUBE contains: - A CUBE contains:
- HBM + memory controller (HBM_CTRL) - HBM + memory controller (HBM_CTRL)
- NOC router mesh: 2D grid of explicit routers (from cube_mesh.yaml) with XY routing; - NOC (on-die fabric): carries all intra-cube traffic including HBM data,
carries all intra-cube traffic including HBM data, inter-cube (UCIe), inter-cube (UCIe), command (M_CPU↔PE_CPU), and shared SRAM access.
command (M_CPU↔PE_CPU), and shared SRAM access. Must provide: full-BW PE↔local HBM path, PE↔SRAM connectivity,
HBM_CTRL is attached to PE routers (local HBM = 0 hop). PE↔UCIe connectivity, M_CPU↔PE command path.
See ADR-0017 and ADR-0019 for full architecture. NOC topology is an implementation choice (e.g., 2D mesh, ring, crossbar);
current implementation uses a 2D mesh with XY routing (see ADR-0017).
HBM_CTRL is attached to each PE's local NOC port (local HBM = minimal hop).
- Shared SRAM: cube-level shared memory accessible by all PEs via NOC - Shared SRAM: cube-level shared memory accessible by all PEs via NOC
- management/control CPU (M_CPU) coordinating PE command distribution and completion aggregation - management/control CPU (M_CPU) coordinating PE command distribution and completion aggregation
- multiple PEs - multiple PEs
@@ -44,15 +44,15 @@ Each PE contains the following logical components.
**PE_DMA** **PE_DMA**
- Handles memory transfers between PE_TCM and external memory domains. - Handles memory transfers between PE_TCM and external memory domains.
- PE_DMA connects to the NOC router mesh at the CUBE level (ADR-0019): - PE_DMA connects to the cube-level NOC (on-die fabric):
- All destinations (HBM, shared SRAM, inter-cube UCIe) are reached via the router mesh - All destinations (HBM, shared SRAM, inter-cube UCIe) are reached via the NOC
- Local HBM access: PE_DMA → local router → hbm_ctrl (switching overhead only) - Local HBM access: PE_DMA → NOC → hbm_ctrl (minimal hop)
- Remote/shared: PE_DMA → local router → (mesh hops) → destination - Remote/shared: PE_DMA → NOC → (fabric hops) → destination
- Supported directions include: - Supported directions include:
- HBM → PE_TCM (via router mesh) - HBM → PE_TCM (via NOC)
- PE_TCM → HBM (via router mesh) - PE_TCM → HBM (via NOC)
- PE_TCM → shared SRAM (via router mesh) - PE_TCM → shared SRAM (via NOC)
- PE_TCM → other memory domains (via router mesh, if supported by topology) - PE_TCM → other memory domains (via NOC, if supported by topology)
**PE_GEMM** **PE_GEMM**
@@ -252,7 +252,7 @@ Compute operations use a TCM-centric dataflow model.
**Input path (HBM)** **Input path (HBM)**
```text ```text
HBM → router mesh → PE_DMA (DMA_READ) → PE_TCM HBM → NOC → PE_DMA (DMA_READ) → PE_TCM
``` ```
**Input path (shared SRAM)** **Input path (shared SRAM)**
@@ -269,14 +269,14 @@ Compute engines read input tensors from PE_TCM.
PE_TCM → GEMM / MATH PE_TCM → GEMM / MATH
``` ```
Weights for GEMM may optionally stream directly from HBM (via router mesh). Weights for GEMM may optionally stream directly from HBM (via NOC).
**Output path (HBM)** **Output path (HBM)**
Compute results are written to PE_TCM, then DMA writes to HBM. Compute results are written to PE_TCM, then DMA writes to HBM.
```text ```text
PE_TCM → PE_DMA (DMA_WRITE) → router mesh → HBM PE_TCM → PE_DMA (DMA_WRITE) → NOC → HBM
``` ```
**Output path (shared SRAM)** **Output path (shared SRAM)**
@@ -348,9 +348,9 @@ PE instances are derived from `cube.pe_layout`.
External connectivity such as: External connectivity such as:
- PE_DMA → router mesh → HBM (data path, ADR-0019) - PE_DMA → NOC → HBM (data path)
- PE_DMA → router mesh → shared SRAM, inter-cube UCIe (non-HBM data path) - PE_DMA → NOC → shared SRAM, inter-cube UCIe (non-HBM data path)
- router mesh → PE_CPU (command path from M_CPU) - NOC → PE_CPU (command path from M_CPU)
is modeled at the CUBE level (see ADR-0003 D3). is modeled at the CUBE level (see ADR-0003 D3).
+27 -16
View File
@@ -106,36 +106,47 @@ Note: multi-stream merging at routers IS modeled correctly — each
in_port has its own fan_in process, all push to a shared inbox, and in_port has its own fan_in process, all push to a shared inbox, and
the router worker forwards in inbox FIFO order. Flits from different the router worker forwards in inbox FIFO order. Flits from different
upstream streams naturally interleave at flit granularity. The items upstream streams naturally interleave at flit granularity. The items
below are different concerns. below are different concerns, ordered by expected workload impact.
**Higher impact (workload accuracy gap)**:
- [ ] **Cycle-accurate router arbitration policies** (RR with
priorities, age, iSLIP). Currently the inbox FIFO order is used as
a proxy for fair RR — works when flit arrival times differ slightly
between streams, but doesn't reflect intentional priority/QoS.
- [ ] **Sub-flit (32B) granularity** for finer wire arbitration
cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates
per 32B flit. Effect is small for most workloads (sub-flit timing
noise).
- [ ] **Address-based PC selection at HBM CTRL** (replace the - [ ] **Address-based PC selection at HBM CTRL** (replace the
address-blind global round-robin). When two transactions of size address-blind global round-robin). When two transactions of size
`num_pcs × burst_bytes` (e.g., 2KB at 8 PCs × 256B) arrive `num_pcs × burst_bytes` (e.g., 2KB at 8 PCs × 256B) arrive
concurrently, both claim PCs 0..7 via global RR, producing full concurrently, both claim PCs 0..7 via global RR, producing full
per-PC contention. Real HW uses address bits to select PCs, so per-PC contention even when real-HW address striping would put
different-address transactions hit different PC patterns. Address them on disjoint PC sets. Directly affects multi-PE concurrent
modeling would let the simulator reflect cache-line/page-aware HBM workload latencies.
layouts.
- [ ] **Bank-level conflict modeling** within a PC (opt-in via - [ ] **Bank-level conflict modeling** within a PC (opt-in via
`track_banks: true`). Currently we assume no same-bank reuse. `track_banks: true`). Currently we assume no same-bank reuse;
random scatter/gather workloads are optimistic here.
- [ ] **HBM scheduler** with write buffer + watermark drain (Tier 2 - [ ] **HBM scheduler** with write buffer + watermark drain (Tier 2
from the design discussion). Default `switch_penalty_ns=0` is the from the design discussion). Default `switch_penalty_ns=0` is the
ideal-amortization stand-in. ideal-amortization stand-in; bursty mixed R/W workloads benefit
- [ ] **Backpressure** modeling for finite component buffers. from explicit modeling.
- [ ] **Backpressure** modeling for finite component buffers. Matters
at high concurrency / sustained saturation where buffer occupancy
causes upstream stalls.
- [ ] **Op_log integration with chunk-streaming**: currently op_log - [ ] **Op_log integration with chunk-streaming**: currently op_log
fires on PE-internal command messages (DmaReadCmd, DmaWriteCmd, fires on PE-internal command messages (DmaReadCmd, DmaWriteCmd,
GemmCmd, MathCmd) which are not chunkified. Integration would GemmCmd, MathCmd) which are not chunkified. Integration would
require flit-aware components to also emit op_log start/end hooks require flit-aware components to also emit op_log start/end hooks
per transaction (start on first flit, end on is_last). per transaction (start on first flit, end on is_last).
**Lower impact (academic / specific use cases)**:
- [ ] **Cycle-accurate router arbitration policies** (RR with
priorities, age, iSLIP). The FIFO inbox is already approximately
fair when flit arrival times differ slightly between streams (the
common case for similar-rate workloads). True impact appears only
for: (a) priority/QoS modeling, (b) per-stream tail latency
analysis under sustained saturation. Not critical for makespan or
average-latency studies.
- [ ] **Sub-flit (32B) granularity** for finer wire arbitration
cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates
per 32B flit. Effect is small for most workloads (sub-flit timing
noise on small messages).
## Consequences ## Consequences
- Single review point for all model fidelity questions. Each future PR - Single review point for all model fidelity questions. Each future PR
@@ -1,13 +1,13 @@
buffer_kind,sip_topology,n_sips,n_elem,bytes_per_pe,latency_ns buffer_kind,sip_topology,n_sips,n_elem,bytes_per_pe,latency_ns
hbm,torus_2d,6,128,256,1858.0399999999827 hbm,torus_2d,6,128,256,2144.0399999999754
hbm,torus_2d,6,1024,2048,2389.0399999999827 hbm,torus_2d,6,1024,2048,2908.74499999995
hbm,torus_2d,6,8192,16384,6673.039999999986 hbm,torus_2d,6,8192,16384,8851.185000000081
hbm,torus_2d,6,32768,65536,21361.03999999992 hbm,torus_2d,6,32768,65536,29225.265000008752
sram,torus_2d,6,128,256,1774.0399999999827 sram,torus_2d,6,128,256,2060.0399999999754
sram,torus_2d,6,1024,2048,2389.0399999999827 sram,torus_2d,6,1024,2048,2908.74499999995
sram,torus_2d,6,8192,16384,7345.039999999986 sram,torus_2d,6,8192,16384,9523.185000000081
sram,torus_2d,6,32768,65536,24337.039999999935 sram,torus_2d,6,32768,65536,32201.265000008752
tcm,torus_2d,6,128,256,1678.0399999999827 tcm,torus_2d,6,128,256,1964.0399999999754
tcm,torus_2d,6,1024,2048,1957.0399999999827 tcm,torus_2d,6,1024,2048,2476.74499999995
tcm,torus_2d,6,8192,16384,4225.039999999986 tcm,torus_2d,6,8192,16384,6403.185000000081
tcm,torus_2d,6,32768,65536,12001.03999999992 tcm,torus_2d,6,32768,65536,19865.265000008738
1 buffer_kind sip_topology n_sips n_elem bytes_per_pe latency_ns
2 hbm torus_2d 6 128 256 1858.0399999999827 2144.0399999999754
3 hbm torus_2d 6 1024 2048 2389.0399999999827 2908.74499999995
4 hbm torus_2d 6 8192 16384 6673.039999999986 8851.185000000081
5 hbm torus_2d 6 32768 65536 21361.03999999992 29225.265000008752
6 sram torus_2d 6 128 256 1774.0399999999827 2060.0399999999754
7 sram torus_2d 6 1024 2048 2389.0399999999827 2908.74499999995
8 sram torus_2d 6 8192 16384 7345.039999999986 9523.185000000081
9 sram torus_2d 6 32768 65536 24337.039999999935 32201.265000008752
10 tcm torus_2d 6 128 256 1678.0399999999827 1964.0399999999754
11 tcm torus_2d 6 1024 2048 1957.0399999999827 2476.74499999995
12 tcm torus_2d 6 8192 16384 4225.039999999986 6403.185000000081
13 tcm torus_2d 6 32768 65536 12001.03999999992 19865.265000008738
Binary file not shown.

Before

Width:  |  Height:  |  Size: 74 KiB

After

Width:  |  Height:  |  Size: 76 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 40 KiB

After

Width:  |  Height:  |  Size: 39 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 82 KiB

After

Width:  |  Height:  |  Size: 79 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 38 KiB

After

Width:  |  Height:  |  Size: 37 KiB

@@ -1,37 +1,37 @@
algorithm,sip_topology,n_sips,n_elem,bytes_per_pe,bytes_per_sip,latency_ns algorithm,sip_topology,n_sips,n_elem,bytes_per_pe,bytes_per_sip,latency_ns
intercube_allreduce,mesh_2d_no_wrap,6,8,16,256,2626.302499999998 intercube_allreduce,mesh_2d_no_wrap,6,8,16,256,2666.5524999999725
intercube_allreduce,mesh_2d_no_wrap,6,32,64,1024,2634.7399999999952 intercube_allreduce,mesh_2d_no_wrap,6,32,64,1024,2747.7399999999725
intercube_allreduce,mesh_2d_no_wrap,6,64,128,2048,2645.9899999999925 intercube_allreduce,mesh_2d_no_wrap,6,64,128,2048,2855.98999999998
intercube_allreduce,mesh_2d_no_wrap,6,128,256,4096,2668.489999999987 intercube_allreduce,mesh_2d_no_wrap,6,128,256,4096,3072.4899999999725
intercube_allreduce,mesh_2d_no_wrap,6,512,1024,16384,2812.489999999987 intercube_allreduce,mesh_2d_no_wrap,6,512,1024,16384,3336.579999999951
intercube_allreduce,mesh_2d_no_wrap,6,1024,2048,32768,3010.489999999987 intercube_allreduce,mesh_2d_no_wrap,6,1024,2048,32768,3707.49999999992
intercube_allreduce,mesh_2d_no_wrap,6,2048,4096,65536,3406.489999999987 intercube_allreduce,mesh_2d_no_wrap,6,2048,4096,65536,4449.339999999875
intercube_allreduce,mesh_2d_no_wrap,6,4096,8192,131072,4198.489999999965 intercube_allreduce,mesh_2d_no_wrap,6,4096,8192,131072,5933.020000000055
intercube_allreduce,mesh_2d_no_wrap,6,8192,16384,262144,5782.489999999969 intercube_allreduce,mesh_2d_no_wrap,6,8192,16384,262144,8900.380000000157
intercube_allreduce,mesh_2d_no_wrap,6,16384,32768,524288,8950.489999999925 intercube_allreduce,mesh_2d_no_wrap,6,16384,32768,524288,14835.099999997583
intercube_allreduce,mesh_2d_no_wrap,6,32768,65536,1048576,15286.48999999986 intercube_allreduce,mesh_2d_no_wrap,6,32768,65536,1048576,26704.540000017492
intercube_allreduce,mesh_2d_no_wrap,6,49152,98304,1572864,21622.489999999932 intercube_allreduce,mesh_2d_no_wrap,6,49152,98304,1572864,38573.980000026335
intercube_allreduce,ring_1d,6,8,16,256,2302.9849999999933 intercube_allreduce,ring_1d,6,8,16,256,2365.2558333333036
intercube_allreduce,ring_1d,6,32,64,1024,2310.8599999999906 intercube_allreduce,ring_1d,6,32,64,1024,2436.9433333333036
intercube_allreduce,ring_1d,6,64,128,2048,2321.359999999988 intercube_allreduce,ring_1d,6,64,128,2048,2532.526666666643
intercube_allreduce,ring_1d,6,128,256,4096,2342.3599999999824 intercube_allreduce,ring_1d,6,128,256,4096,2723.6933333333036
intercube_allreduce,ring_1d,6,512,1024,16384,2479.3599999999824 intercube_allreduce,ring_1d,6,512,1024,16384,3042.0349999999544
intercube_allreduce,ring_1d,6,1024,2048,32768,2669.3599999999824 intercube_allreduce,ring_1d,6,1024,2048,32768,3390.201666666597
intercube_allreduce,ring_1d,6,2048,4096,65536,3049.3599999999824 intercube_allreduce,ring_1d,6,2048,4096,65536,4079.7349999998714
intercube_allreduce,ring_1d,6,4096,8192,131072,3809.3599999999715 intercube_allreduce,ring_1d,6,4096,8192,131072,5458.801666666721
intercube_allreduce,ring_1d,6,8192,16384,262144,5329.359999999979 intercube_allreduce,ring_1d,6,8192,16384,262144,8216.93500000014
intercube_allreduce,ring_1d,6,16384,32768,524288,8369.35999999992 intercube_allreduce,ring_1d,6,16384,32768,524288,13733.201666664638
intercube_allreduce,ring_1d,6,32768,65536,1048576,14449.359999999899 intercube_allreduce,ring_1d,6,32768,65536,1048576,24765.735000014545
intercube_allreduce,ring_1d,6,49152,98304,1572864,20529.35999999997 intercube_allreduce,ring_1d,6,49152,98304,1572864,35798.268333355256
intercube_allreduce,torus_2d,6,8,16,256,1644.2899999999936 intercube_allreduce,torus_2d,6,8,16,256,1700.6024999999754
intercube_allreduce,torus_2d,6,32,64,1024,1651.0399999999909 intercube_allreduce,torus_2d,6,32,64,1024,1753.2899999999754
intercube_allreduce,torus_2d,6,64,128,2048,1660.0399999999881 intercube_allreduce,torus_2d,6,64,128,2048,1823.539999999979
intercube_allreduce,torus_2d,6,128,256,4096,1678.0399999999827 intercube_allreduce,torus_2d,6,128,256,4096,1964.0399999999754
intercube_allreduce,torus_2d,6,512,1024,16384,1795.0399999999827 intercube_allreduce,torus_2d,6,512,1024,16384,2196.2849999999653
intercube_allreduce,torus_2d,6,1024,2048,32768,1957.0399999999827 intercube_allreduce,torus_2d,6,1024,2048,32768,2476.74499999995
intercube_allreduce,torus_2d,6,2048,4096,65536,2281.0399999999827 intercube_allreduce,torus_2d,6,2048,4096,65536,3037.664999999919
intercube_allreduce,torus_2d,6,4096,8192,131072,2929.039999999979 intercube_allreduce,torus_2d,6,4096,8192,131072,4159.50500000003
intercube_allreduce,torus_2d,6,8192,16384,262144,4225.039999999986 intercube_allreduce,torus_2d,6,8192,16384,262144,6403.185000000081
intercube_allreduce,torus_2d,6,16384,32768,524288,6817.039999999943 intercube_allreduce,torus_2d,6,16384,32768,524288,10890.544999998769
intercube_allreduce,torus_2d,6,32768,65536,1048576,12001.03999999992 intercube_allreduce,torus_2d,6,32768,65536,1048576,19865.265000008738
intercube_allreduce,torus_2d,6,49152,98304,1572864,17185.039999999994 intercube_allreduce,torus_2d,6,49152,98304,1572864,28839.985000013185
1 algorithm sip_topology n_sips n_elem bytes_per_pe bytes_per_sip latency_ns
2 intercube_allreduce mesh_2d_no_wrap 6 8 16 256 2626.302499999998 2666.5524999999725
3 intercube_allreduce mesh_2d_no_wrap 6 32 64 1024 2634.7399999999952 2747.7399999999725
4 intercube_allreduce mesh_2d_no_wrap 6 64 128 2048 2645.9899999999925 2855.98999999998
5 intercube_allreduce mesh_2d_no_wrap 6 128 256 4096 2668.489999999987 3072.4899999999725
6 intercube_allreduce mesh_2d_no_wrap 6 512 1024 16384 2812.489999999987 3336.579999999951
7 intercube_allreduce mesh_2d_no_wrap 6 1024 2048 32768 3010.489999999987 3707.49999999992
8 intercube_allreduce mesh_2d_no_wrap 6 2048 4096 65536 3406.489999999987 4449.339999999875
9 intercube_allreduce mesh_2d_no_wrap 6 4096 8192 131072 4198.489999999965 5933.020000000055
10 intercube_allreduce mesh_2d_no_wrap 6 8192 16384 262144 5782.489999999969 8900.380000000157
11 intercube_allreduce mesh_2d_no_wrap 6 16384 32768 524288 8950.489999999925 14835.099999997583
12 intercube_allreduce mesh_2d_no_wrap 6 32768 65536 1048576 15286.48999999986 26704.540000017492
13 intercube_allreduce mesh_2d_no_wrap 6 49152 98304 1572864 21622.489999999932 38573.980000026335
14 intercube_allreduce ring_1d 6 8 16 256 2302.9849999999933 2365.2558333333036
15 intercube_allreduce ring_1d 6 32 64 1024 2310.8599999999906 2436.9433333333036
16 intercube_allreduce ring_1d 6 64 128 2048 2321.359999999988 2532.526666666643
17 intercube_allreduce ring_1d 6 128 256 4096 2342.3599999999824 2723.6933333333036
18 intercube_allreduce ring_1d 6 512 1024 16384 2479.3599999999824 3042.0349999999544
19 intercube_allreduce ring_1d 6 1024 2048 32768 2669.3599999999824 3390.201666666597
20 intercube_allreduce ring_1d 6 2048 4096 65536 3049.3599999999824 4079.7349999998714
21 intercube_allreduce ring_1d 6 4096 8192 131072 3809.3599999999715 5458.801666666721
22 intercube_allreduce ring_1d 6 8192 16384 262144 5329.359999999979 8216.93500000014
23 intercube_allreduce ring_1d 6 16384 32768 524288 8369.35999999992 13733.201666664638
24 intercube_allreduce ring_1d 6 32768 65536 1048576 14449.359999999899 24765.735000014545
25 intercube_allreduce ring_1d 6 49152 98304 1572864 20529.35999999997 35798.268333355256
26 intercube_allreduce torus_2d 6 8 16 256 1644.2899999999936 1700.6024999999754
27 intercube_allreduce torus_2d 6 32 64 1024 1651.0399999999909 1753.2899999999754
28 intercube_allreduce torus_2d 6 64 128 2048 1660.0399999999881 1823.539999999979
29 intercube_allreduce torus_2d 6 128 256 4096 1678.0399999999827 1964.0399999999754
30 intercube_allreduce torus_2d 6 512 1024 16384 1795.0399999999827 2196.2849999999653
31 intercube_allreduce torus_2d 6 1024 2048 32768 1957.0399999999827 2476.74499999995
32 intercube_allreduce torus_2d 6 2048 4096 65536 2281.0399999999827 3037.664999999919
33 intercube_allreduce torus_2d 6 4096 8192 131072 2929.039999999979 4159.50500000003
34 intercube_allreduce torus_2d 6 8192 16384 262144 4225.039999999986 6403.185000000081
35 intercube_allreduce torus_2d 6 16384 32768 524288 6817.039999999943 10890.544999998769
36 intercube_allreduce torus_2d 6 32768 65536 1048576 12001.03999999992 19865.265000008738
37 intercube_allreduce torus_2d 6 49152 98304 1572864 17185.039999999994 28839.985000013185
Binary file not shown.

Before

Width:  |  Height:  |  Size: 38 KiB

After

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 233 KiB

After

Width:  |  Height:  |  Size: 233 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 166 KiB

After

Width:  |  Height:  |  Size: 165 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 50 KiB

After

Width:  |  Height:  |  Size: 47 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 49 KiB

After

Width:  |  Height:  |  Size: 47 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 54 KiB

After

Width:  |  Height:  |  Size: 51 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 53 KiB

After

Width:  |  Height:  |  Size: 43 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 109 KiB

After

Width:  |  Height:  |  Size: 111 KiB

+80 -80
View File
@@ -1,81 +1,81 @@
hop,label,size_bytes,path,total_ns hop,label,size_bytes,path,total_ns
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,ipcq,31.3899999999976 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,ipcq,42.8899999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,raw,12.019999999996799 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,raw,29.0199999999968
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,ipcq,33.1399999999976 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,ipcq,48.1399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,raw,13.019999999996799 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,raw,31.0199999999968
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,ipcq,34.8899999999976 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,ipcq,50.3899999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,raw,14.019999999996799 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,raw,32.0199999999968
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,ipcq,36.6399999999976 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,ipcq,52.6399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,raw,15.019999999996799 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,raw,33.0199999999968
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,ipcq,40.1399999999976 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,ipcq,57.1399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,raw,17.0199999999968 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,raw,35.0199999999968
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,ipcq,43.6399999999976 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,ipcq,62.6399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,raw,19.0199999999968 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,raw,37.0199999999968
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,ipcq,57.6399999999976 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,ipcq,84.6399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,raw,27.0199999999968 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,raw,45.0199999999968
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,ipcq,85.6399999999976 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,ipcq,128.6399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,raw,43.0199999999968 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,raw,61.0199999999968
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,ipcq,141.64000000000306 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,ipcq,216.64000000000306
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,raw,75.02000000000407 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,raw,93.02000000000407
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,ipcq,169.64000000000306 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,ipcq,260.64000000000306
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,raw,91.02000000000407 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,raw,109.02000000000407
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,ipcq,31.3899999999976 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,ipcq,42.8899999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,raw,12.019999999996799 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,raw,29.0199999999968
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,ipcq,33.1399999999976 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,ipcq,48.1399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,raw,13.019999999996799 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,raw,31.0199999999968
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,ipcq,34.8899999999976 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,ipcq,50.3899999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,raw,14.019999999996799 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,raw,32.0199999999968
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,ipcq,36.6399999999976 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,ipcq,52.6399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,raw,15.019999999996799 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,raw,33.0199999999968
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,ipcq,40.1399999999976 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,ipcq,57.1399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,raw,17.0199999999968 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,raw,35.0199999999968
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,ipcq,43.6399999999976 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,ipcq,62.6399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,raw,19.0199999999968 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,raw,37.0199999999968
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,ipcq,57.6399999999976 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,ipcq,84.6399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,raw,27.0199999999968 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,raw,45.0199999999968
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,ipcq,85.6399999999976 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,ipcq,128.6399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,raw,43.0199999999968 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,raw,61.0199999999968
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,ipcq,141.64000000000306 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,ipcq,216.64000000000306
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,raw,75.02000000000407 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,raw,93.02000000000407
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,ipcq,169.64000000000306 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,ipcq,260.64000000000306
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,raw,91.02000000000407 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,raw,109.02000000000407
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,ipcq,67.40999999999804 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,ipcq,81.15999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,raw,68.53999999999724 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,raw,89.28999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,ipcq,69.15999999999804 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,ipcq,88.65999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,raw,70.03999999999724 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,raw,95.53999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,ipcq,70.90999999999804 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,ipcq,90.90999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,raw,71.53999999999724 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,raw,96.53999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,ipcq,72.65999999999804 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,ipcq,93.15999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,raw,73.03999999999724 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,raw,97.53999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,ipcq,76.15999999999804 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,ipcq,97.65999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,raw,76.03999999999724 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,raw,99.53999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,ipcq,79.65999999999804 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,ipcq,103.15999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,raw,79.03999999999724 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,raw,102.53999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,ipcq,93.65999999999804 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,ipcq,125.15999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,raw,91.03999999999724 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,raw,114.53999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,ipcq,121.65999999999804 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,ipcq,169.15999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,raw,115.03999999999724 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,raw,138.53999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,ipcq,177.65999999999985 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,ipcq,257.15999999999985
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,raw,163.04000000000087 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,raw,186.54000000000087
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,ipcq,205.65999999999985 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,ipcq,301.15999999999985
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,raw,187.04000000000087 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,raw,210.54000000000087
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,ipcq,87.40999999999804 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,ipcq,103.15999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,raw,88.53999999999724 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,raw,111.28999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,ipcq,89.15999999999804 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,ipcq,112.65999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,raw,90.03999999999724 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,raw,119.53999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,ipcq,90.90999999999804 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,ipcq,114.90999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,raw,91.53999999999724 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,raw,120.53999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,ipcq,92.65999999999804 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,ipcq,117.15999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,raw,93.03999999999724 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,raw,121.53999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,ipcq,96.15999999999804 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,ipcq,121.65999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,raw,96.03999999999724 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,raw,123.53999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,ipcq,99.65999999999804 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,ipcq,127.15999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,raw,99.03999999999724 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,raw,126.53999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,ipcq,113.65999999999804 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,ipcq,149.15999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,raw,111.03999999999724 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,raw,138.53999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,ipcq,141.65999999999804 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,ipcq,193.15999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,raw,135.03999999999724 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,raw,162.53999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,ipcq,197.65999999999985 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,ipcq,281.15999999999985
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,raw,183.04000000000087 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,raw,210.54000000000087
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,ipcq,225.65999999999985 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,ipcq,325.15999999999985
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,raw,207.04000000000087 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,raw,234.54000000000087
1 hop label size_bytes path total_ns
2 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 128 ipcq 31.3899999999976 42.8899999999976
3 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 128 raw 12.019999999996799 29.0199999999968
4 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 256 ipcq 33.1399999999976 48.1399999999976
5 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 256 raw 13.019999999996799 31.0199999999968
6 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 384 ipcq 34.8899999999976 50.3899999999976
7 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 384 raw 14.019999999996799 32.0199999999968
8 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 512 ipcq 36.6399999999976 52.6399999999976
9 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 512 raw 15.019999999996799 33.0199999999968
10 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 768 ipcq 40.1399999999976 57.1399999999976
11 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 768 raw 17.0199999999968 35.0199999999968
12 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 1024 ipcq 43.6399999999976 62.6399999999976
13 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 1024 raw 19.0199999999968 37.0199999999968
14 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 2048 ipcq 57.6399999999976 84.6399999999976
15 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 2048 raw 27.0199999999968 45.0199999999968
16 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 4096 ipcq 85.6399999999976 128.6399999999976
17 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 4096 raw 43.0199999999968 61.0199999999968
18 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 8192 ipcq 141.64000000000306 216.64000000000306
19 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 8192 raw 75.02000000000407 93.02000000000407
20 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 10240 ipcq 169.64000000000306 260.64000000000306
21 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 10240 raw 91.02000000000407 109.02000000000407
22 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 128 ipcq 31.3899999999976 42.8899999999976
23 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 128 raw 12.019999999996799 29.0199999999968
24 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 256 ipcq 33.1399999999976 48.1399999999976
25 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 256 raw 13.019999999996799 31.0199999999968
26 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 384 ipcq 34.8899999999976 50.3899999999976
27 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 384 raw 14.019999999996799 32.0199999999968
28 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 512 ipcq 36.6399999999976 52.6399999999976
29 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 512 raw 15.019999999996799 33.0199999999968
30 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 768 ipcq 40.1399999999976 57.1399999999976
31 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 768 raw 17.0199999999968 35.0199999999968
32 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 1024 ipcq 43.6399999999976 62.6399999999976
33 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 1024 raw 19.0199999999968 37.0199999999968
34 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 2048 ipcq 57.6399999999976 84.6399999999976
35 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 2048 raw 27.0199999999968 45.0199999999968
36 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 4096 ipcq 85.6399999999976 128.6399999999976
37 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 4096 raw 43.0199999999968 61.0199999999968
38 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 8192 ipcq 141.64000000000306 216.64000000000306
39 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 8192 raw 75.02000000000407 93.02000000000407
40 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 10240 ipcq 169.64000000000306 260.64000000000306
41 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 10240 raw 91.02000000000407 109.02000000000407
42 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 128 ipcq 67.40999999999804 81.15999999999804
43 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 128 raw 68.53999999999724 89.28999999999724
44 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 256 ipcq 69.15999999999804 88.65999999999804
45 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 256 raw 70.03999999999724 95.53999999999724
46 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 384 ipcq 70.90999999999804 90.90999999999804
47 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 384 raw 71.53999999999724 96.53999999999724
48 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 512 ipcq 72.65999999999804 93.15999999999804
49 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 512 raw 73.03999999999724 97.53999999999724
50 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 768 ipcq 76.15999999999804 97.65999999999804
51 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 768 raw 76.03999999999724 99.53999999999724
52 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 1024 ipcq 79.65999999999804 103.15999999999804
53 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 1024 raw 79.03999999999724 102.53999999999724
54 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 2048 ipcq 93.65999999999804 125.15999999999804
55 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 2048 raw 91.03999999999724 114.53999999999724
56 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 4096 ipcq 121.65999999999804 169.15999999999804
57 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 4096 raw 115.03999999999724 138.53999999999724
58 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 8192 ipcq 177.65999999999985 257.15999999999985
59 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 8192 raw 163.04000000000087 186.54000000000087
60 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 10240 ipcq 205.65999999999985 301.15999999999985
61 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 10240 raw 187.04000000000087 210.54000000000087
62 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 128 ipcq 87.40999999999804 103.15999999999804
63 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 128 raw 88.53999999999724 111.28999999999724
64 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 256 ipcq 89.15999999999804 112.65999999999804
65 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 256 raw 90.03999999999724 119.53999999999724
66 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 384 ipcq 90.90999999999804 114.90999999999804
67 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 384 raw 91.53999999999724 120.53999999999724
68 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 512 ipcq 92.65999999999804 117.15999999999804
69 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 512 raw 93.03999999999724 121.53999999999724
70 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 768 ipcq 96.15999999999804 121.65999999999804
71 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 768 raw 96.03999999999724 123.53999999999724
72 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 1024 ipcq 99.65999999999804 127.15999999999804
73 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 1024 raw 99.03999999999724 126.53999999999724
74 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 2048 ipcq 113.65999999999804 149.15999999999804
75 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 2048 raw 111.03999999999724 138.53999999999724
76 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 4096 ipcq 141.65999999999804 193.15999999999804
77 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 4096 raw 135.03999999999724 162.53999999999724
78 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 8192 ipcq 197.65999999999985 281.15999999999985
79 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 8192 raw 183.04000000000087 210.54000000000087
80 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 10240 ipcq 225.65999999999985 325.15999999999985
81 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 10240 raw 207.04000000000087 234.54000000000087
Binary file not shown.

After

Width:  |  Height:  |  Size: 368 KiB

+99 -31
View File
@@ -1,33 +1,101 @@
<svg xmlns="http://www.w3.org/2000/svg" width="500" height="360" viewBox="0 0 500 360"> <svg xmlns="http://www.w3.org/2000/svg" width="560" height="420" viewBox="0 0 560 420">
<title>pe</title> <title>pe</title>
<rect width="500" height="360" fill="#f8fafc"/> <rect width="560" height="420" fill="#f8fafc"/>
<text x="250" y="18" text-anchor="middle" font-family="monospace" font-size="14" font-weight="bold" fill="#1e293b">PE VIEW</text> <text x="280" y="18" text-anchor="middle" font-family="monospace" font-size="14" font-weight="bold" fill="#1e293b">PE VIEW</text>
<line x1="92.5" y1="180.0" x2="180.0" y2="180.0" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
<text x="136.2" y="176.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">0.5mm</text> <!-- ── Boxes ── -->
<polyline points="180.0,180.0 180.0,92.5 285.0,92.5" fill="none" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
<text x="232.5" y="132.2" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">0.5mm</text> <!-- PE CPU -->
<line x1="180.0" y1="180.0" x2="285.0" y2="180.0" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/> <rect x="48.8" y="185.5" width="87.5" height="49.0" rx="4" fill="#ef4444" stroke="#475569" stroke-width="1"/>
<text x="232.5" y="176.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">0.5mm</text> <text x="92.5" y="214.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">PE CPU</text>
<polyline points="180.0,180.0 180.0,267.5 285.0,267.5" fill="none" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
<text x="232.5" y="219.8" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">0.5mm</text> <!-- PE SCHEDULER -->
<polyline points="285.0,92.5 390.0,92.5 390.0,180.0" fill="none" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/> <rect x="156.2" y="185.5" width="87.5" height="49.0" rx="4" fill="#f59e0b" stroke="#475569" stroke-width="1"/>
<text x="337.5" y="132.2" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">0.5mm 512GB/s</text> <text x="200.0" y="214.0" text-anchor="middle" font-family="monospace" font-size="9" fill="#1e293b">PE SCHEDULER</text>
<line x1="285.0" y1="180.0" x2="390.0" y2="180.0" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
<text x="337.5" y="176.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">0.5mm 512GB/s</text> <!-- PE_IPCQ (control plane) — new -->
<polyline points="285.0,267.5 390.0,267.5 390.0,180.0" fill="none" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/> <rect x="48.8" y="68.0" width="105" height="49.0" rx="4" fill="#0ea5e9" stroke="#0277bd" stroke-width="1.5" stroke-dasharray="5,3"/>
<text x="337.5" y="219.8" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">0.5mm 512GB/s</text> <text x="101.3" y="89.0" text-anchor="middle" font-family="monospace" font-size="9" fill="#ffffff">PE IPCQ</text>
<rect x="48.8" y="155.5" width="87.5" height="49.0" rx="4" fill="#ef4444" stroke="#475569" stroke-width="1"/> <text x="101.3" y="102.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#e0f2fe">(control plane)</text>
<text x="92.5" y="184.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">PE CPU</text>
<rect x="136.2" y="155.5" width="87.5" height="49.0" rx="4" fill="#f59e0b" stroke="#475569" stroke-width="1"/> <!-- PE MMU -->
<text x="180.0" y="184.0" text-anchor="middle" font-family="monospace" font-size="9" fill="#1e293b">PE SCHEDULER</text> <rect x="173.8" y="68.0" width="87.5" height="49.0" rx="4" fill="#e2e8f0" stroke="#475569" stroke-width="1"/>
<rect x="241.2" y="68.0" width="87.5" height="49.0" rx="4" fill="#3b82f6" stroke="#475569" stroke-width="1"/> <text x="217.5" y="96.5" text-anchor="middle" font-family="monospace" font-size="10" fill="#1e293b">PE MMU</text>
<text x="285.0" y="96.5" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">PE DMA</text>
<rect x="241.2" y="155.5" width="87.5" height="49.0" rx="4" fill="#8b5cf6" stroke="#475569" stroke-width="1"/> <!-- PE DMA -->
<text x="285.0" y="184.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">PE GEMM</text> <rect x="281.2" y="68.0" width="87.5" height="49.0" rx="4" fill="#3b82f6" stroke="#475569" stroke-width="1"/>
<rect x="241.2" y="243.0" width="87.5" height="49.0" rx="4" fill="#ec4899" stroke="#475569" stroke-width="1"/> <text x="325.0" y="96.5" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">PE DMA</text>
<text x="285.0" y="271.5" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">PE MATH</text>
<rect x="136.2" y="68.0" width="87.5" height="49.0" rx="4" fill="#e2e8f0" stroke="#475569" stroke-width="1"/> <!-- PE GEMM -->
<text x="180.0" y="96.5" text-anchor="middle" font-family="monospace" font-size="10" fill="#1e293b">PE MMU</text> <rect x="281.2" y="185.5" width="87.5" height="49.0" rx="4" fill="#8b5cf6" stroke="#475569" stroke-width="1"/>
<rect x="346.2" y="155.5" width="87.5" height="49.0" rx="4" fill="#10b981" stroke="#475569" stroke-width="1"/> <text x="325.0" y="214.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">PE GEMM</text>
<text x="390.0" y="184.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">PE TCM</text>
<!-- PE MATH -->
<rect x="281.2" y="283.0" width="87.5" height="49.0" rx="4" fill="#ec4899" stroke="#475569" stroke-width="1"/>
<text x="325.0" y="311.5" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">PE MATH</text>
<!-- PE TCM (with IPCQ Slot Region) -->
<rect x="396.2" y="155.5" width="120" height="100" rx="4" fill="#10b981" stroke="#475569" stroke-width="1"/>
<text x="456.2" y="180.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">PE TCM</text>
<!-- IPCQ Slot Region inside TCM -->
<rect x="406.2" y="193.0" width="100" height="28" rx="3" fill="#065f46" stroke="#ffffff" stroke-width="1" stroke-dasharray="4,2" opacity="0.7"/>
<text x="456.2" y="211.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#d1fae5">IPCQ Slot Region</text>
<!-- ── Connections (edges) ── -->
<!-- PE CPU → PE SCHEDULER -->
<line x1="136.3" y1="210.0" x2="156.2" y2="210.0" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
<text x="146.2" y="205.0" text-anchor="middle" font-family="monospace" font-size="6" fill="#64748b">cmd</text>
<!-- PE CPU → PE_IPCQ (IpcqRequest) -->
<line x1="92.5" y1="185.5" x2="92.5" y2="117.0" stroke="#0277bd" stroke-width="1.5"/>
<polygon points="92.5,117.0 89.5,123.0 95.5,123.0" fill="#0277bd"/>
<text x="77" y="152.0" text-anchor="middle" font-family="monospace" font-size="6" fill="#0277bd" transform="rotate(-90,77,152)">IpcqRequest</text>
<!-- PE SCHEDULER → PE DMA (TileToken, compute port) -->
<polyline points="200.0,185.5 200.0,92.5 281.2,92.5" fill="none" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
<text x="240.0" y="86.5" text-anchor="middle" font-family="monospace" font-size="6" fill="#64748b">TileToken</text>
<!-- PE SCHEDULER → PE GEMM -->
<line x1="243.7" y1="210.0" x2="281.2" y2="210.0" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
<!-- PE SCHEDULER → PE MATH -->
<polyline points="200.0,234.5 200.0,307.5 281.2,307.5" fill="none" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
<!-- PE DMA ↔ PE TCM -->
<line x1="368.7" y1="92.5" x2="456.2" y2="155.5" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
<text x="420.0" y="118.0" text-anchor="middle" font-family="monospace" font-size="6" fill="#64748b">DMA R/W</text>
<!-- PE GEMM → PE TCM -->
<line x1="368.7" y1="210.0" x2="396.2" y2="210.0" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
<text x="382.4" y="205.0" text-anchor="middle" font-family="monospace" font-size="6" fill="#64748b">512GB/s</text>
<!-- PE MATH → PE TCM -->
<polyline points="368.7,307.5 456.2,307.5 456.2,255.5" fill="none" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
<text x="412.4" y="301.0" text-anchor="middle" font-family="monospace" font-size="6" fill="#64748b">512GB/s</text>
<!-- PE_IPCQ → PE DMA (IpcqDmaToken, IPCQ port) — blue -->
<line x1="153.8" y1="82.0" x2="281.2" y2="82.0" stroke="#1565c0" stroke-width="1.5"/>
<polygon points="281.2,82.0 275.2,79.0 275.2,85.0" fill="#1565c0"/>
<text x="217.5" y="77.0" text-anchor="middle" font-family="monospace" font-size="6" fill="#1565c0">IpcqDmaToken</text>
<!-- PE DMA → PE_IPCQ (IpcqMetaArrival) — blue -->
<line x1="281.2" y1="102.0" x2="153.8" y2="102.0" stroke="#1565c0" stroke-width="1.5"/>
<polygon points="153.8,102.0 159.8,99.0 159.8,105.0" fill="#1565c0"/>
<text x="217.5" y="113.0" text-anchor="middle" font-family="monospace" font-size="6" fill="#1565c0">IpcqMetaArrival</text>
<!-- PE_IPCQ → PE DMA (IpcqCreditMetadata, dashed purple) -->
<line x1="153.8" y1="92.5" x2="281.2" y2="92.5" stroke="#7b1fa2" stroke-width="1" stroke-dasharray="4,3"/>
<text x="217.5" y="62.0" text-anchor="middle" font-family="monospace" font-size="6" fill="#7b1fa2">IpcqCreditMeta (dashed)</text>
<!-- ── Legend ── -->
<rect x="15" y="365" width="530" height="45" rx="4" fill="#f1f5f9" stroke="#cbd5e1" stroke-width="0.5"/>
<line x1="25" y1="385" x2="55" y2="385" stroke="#1565c0" stroke-width="1.5"/>
<text x="60" y="388" font-family="monospace" font-size="7" fill="#1e293b">IPCQ data path</text>
<line x1="140" y1="385" x2="170" y2="385" stroke="#7b1fa2" stroke-width="1" stroke-dasharray="4,3"/>
<text x="175" y="388" font-family="monospace" font-size="7" fill="#1e293b">IPCQ credit return</text>
<line x1="290" y1="385" x2="320" y2="385" stroke="#94a3b8" stroke-width="1.5"/>
<text x="325" y="388" font-family="monospace" font-size="7" fill="#1e293b">Compute data path</text>
<rect x="430" y="378" width="40" height="14" rx="2" fill="none" stroke="#0277bd" stroke-width="1" stroke-dasharray="4,2"/>
<text x="475" y="388" font-family="monospace" font-size="7" fill="#1e293b">IPCQ (new)</text>
</svg> </svg>

Before

Width:  |  Height:  |  Size: 3.4 KiB

After

Width:  |  Height:  |  Size: 6.6 KiB

+237
View File
@@ -0,0 +1,237 @@
# Hardware Architecture Overview
본 문서는 AI Accelerator 플랫폼의 하드웨어 아키텍처를 요약한다.
논문 분석 및 설계 검토 시 배경 지식으로 사용할 수 있다.
> Source ADRs: ADR-0003, ADR-0004, ADR-0014, ADR-0017, ADR-0022
---
## 1. System Hierarchy
시스템은 4단계 계층으로 구성된다.
```
Tray
├── Host CPU (runtime, data placement)
├── SIP 0 (accelerator)
│ ├── IO Chiplet (PCIe-EP, IO_CPU)
│ ├── CUBE 0
│ │ ├── PE 0 ─ PE 7
│ │ ├── HBM + HBM_CTRL
│ │ ├── Shared SRAM
│ │ ├── M_CPU (management)
│ │ ├── NOC 2D Mesh (router grid)
│ │ └── UCIe × 4 (N/S/E/W)
│ ├── CUBE 1 ... CUBE N
│ └── IO Chiplet(s)
├── SIP 1 ... SIP M
└── Interconnect (PCIe / UAL)
```
| Level | 구성 | 연결 |
|-------|------|------|
| **Tray** | Host CPU + 여러 SIP | PCIe / UAL fabric |
| **SIP** | 여러 CUBE + IO chiplet(s) | UCIe (cube간), PCIe-EP (host) |
| **CUBE** | 여러 PE + HBM + SRAM + M_CPU + NOC mesh | UCIe × 4 ports (N/S/E/W) |
| **PE** | PE_CPU + DMA + GEMM + MATH + TCM | NOC router 직결 |
---
## 2. CUBE Architecture
각 CUBE는 독립적인 compute + memory unit이다.
### 2.1 Components
- **PEs**: 복수의 Processing Element, 각각 독립 커널 실행 가능
- **HBM + HBM_CTRL**: High Bandwidth Memory. 각 PE에 local HBM 영역이 할당되어 최소 latency로 접근
- **Shared SRAM**: Cube 내 모든 PE가 NOC를 통해 접근 가능한 공유 메모리
- **M_CPU**: Management CPU. 커널 command 분배 및 completion 집계
- **NOC (On-die Fabric)**: Cube 내 모든 컴포넌트를 연결하는 interconnect
- **UCIe × 4**: 각 방향(N/S/E/W)에 복수 connection, inter-cube 연결
### 2.2 NOC (On-die Fabric)
NOC는 cube 내 PE, HBM, SRAM, M_CPU, UCIe를 연결하는 on-die interconnect이다.
**아키텍처 요구사항** (topology 무관):
- 모든 PE가 local HBM에 full bandwidth로 접근 가능
- 모든 PE가 shared SRAM에 접근 가능
- 모든 PE가 UCIe를 통해 다른 cube에 접근 가능
- M_CPU가 모든 PE에 command를 전달 가능
- Per-link contention 모델링 지원
**현재 시뮬레이터 구현** (변경 가능):
- 2D mesh router grid (6×6 기본, XY deterministic routing)
- HBM_CTRL가 각 PE의 local router에 직결 (0 mesh hop)
- 중앙 HBM zone에는 router 배치 제외
- Contention: directed segment당 capacity=1 resource
NOC topology는 2D mesh 외에 ring, crossbar, hierarchical 등 다른 구현도 가능하며,
아키텍처 요구사항을 만족하는 한 교체 가능하다.
### 2.3 주요 Data Path
| Path | Route | 특성 |
|------|-------|------|
| PE → Local HBM | PE_DMA → NOC → HBM_CTRL | 최소 hop, 256 GB/s (×0.8 eff) |
| PE → Remote PE's HBM | PE_DMA → NOC hops → HBM_CTRL | NOC BW/hop에 제한 |
| PE → Shared SRAM | PE_DMA → NOC → SRAM | SRAM link BW에 제한 |
| PE → Other CUBE's HBM | PE_DMA → NOC → UCIe → NOC → HBM_CTRL | UCIe overhead 16ns (TX+RX) |
| Kernel Launch | IO → UCIe → M_CPU → NOC → PE_CPU | Command path |
### 2.4 Key Bandwidths
| Connection | Bandwidth | Notes |
|------------|-----------|-------|
| PE_DMA ↔ NOC | 256 GB/s | HBM slice BW 매칭 |
| NOC ↔ HBM_CTRL | 256 GB/s | Per PE, local 접근 |
| NOC ↔ SRAM | 128 GB/s × 4 | 512 GB/s aggregate |
| NOC ↔ UCIe conn | 128 GB/s × 4 | 512 GB/s per port |
| UCIe link (inter-cube) | 512 GB/s | 1.0mm seam distance |
---
## 3. PE Architecture
각 PE는 하나의 커널 인스턴스를 실행하는 독립적인 프로세서이다.
### 3.1 Internal Components
```
PE_CPU (control)
├──→ PE_SCHED (dispatch)
│ │
│ ├──→ PE_DMA ←→ NOC Router ←→ HBM / SRAM / UCIe
│ │ ↕
│ ├──→ PE_FETCH_STORE ←→ PE_TCM (16MB SRAM)
│ │
│ ├──→ PE_GEMM (matrix multiply)
│ └──→ PE_MATH (elementwise)
└──→ PE_IPCQ (collective communication)
└──→ PE_DMA (IPCQ port)
```
| Component | 역할 |
|-----------|------|
| **PE_CPU** | 커널 instruction stream 실행, command 생성 |
| **PE_SCHED** | Command dispatcher. Composite command를 tile pipeline으로 분해 |
| **PE_DMA** | HBM ↔ TCM 데이터 전송 (NOC router mesh 경유). Read/Write 각 1 channel |
| **PE_GEMM** | 행렬 곱 엔진. TCM에서 activation 읽기, HBM에서 weight streaming 가능 |
| **PE_MATH** | Element-wise 연산 엔진. TCM 읽기/쓰기 |
| **PE_TCM** | 16MB on-PE SRAM. Compute의 staging memory |
| **PE_IPCQ** | PE간 collective communication 제어 (ring buffer pointer 관리) |
### 3.2 Compute Pipeline (Tiled Execution)
Composite command는 tile 단위로 pipeline 실행된다:
```
DMA_READ(t) → COMPUTE(t) → DMA_WRITE(t)
```
**Overlap 규칙**:
- 허용: `DMA_READ(t+1) ∥ COMPUTE(t)`, `DMA_WRITE(t-1) ∥ COMPUTE(t)`
- 금지: `GEMM(t) ∥ GEMM(t')`, `GEMM(t) ∥ MATH(t')`
**DMA Engine**: Read/Write 각각 capacity=1. 동시 Read+Write 가능, 동시 Read+Read 불가.
**Compute Engine**: GEMM과 MATH가 단일 compute slot 공유. 한 번에 하나만 실행.
### 3.3 TCM-centric Dataflow
모든 compute는 TCM을 중심으로 동작한다:
```
Input: HBM → (NOC) → PE_DMA → PE_TCM
Compute: PE_TCM → GEMM / MATH → PE_TCM
Output: PE_TCM → PE_DMA → (NOC) → HBM
```
PE_TCM은 두 영역으로 분할된다:
- **SchedulerReservedTCM**: PE_SCHED 전용 tile buffer 영역 (DMA/compute staging)
- **AllocatableTCM**: 범용 할당 영역 (host/DP-visible)
두 영역은 hard isolation으로 분리된다.
---
## 4. Memory Hierarchy
### 4.1 Memory Tiers
| Memory | Scope | Capacity | Bandwidth | Latency | 접근 경로 |
|--------|-------|----------|-----------|---------|-----------|
| **PE_TCM** | PE 전용 | 16 MB | 512 GB/s | 최저 | 직결 (NOC 미경유) |
| **Shared SRAM** | Cube 공유 | 32 MB | 128 GB/s (NoC link) | 중간 | PE → NOC → SRAM |
| **Local HBM** | PE별 할당 | Large | 256 GB/s (×0.8 eff) | 높음 | PE → local router → HBM_CTRL |
| **Remote HBM** | 다른 PE/Cube | Large | Mesh/UCIe BW 제한 | 최고 | PE → NOC mesh → (UCIe) → HBM_CTRL |
### 4.2 Local HBM Bandwidth Guarantee
- 각 PE는 자신의 local router에 직결된 HBM pseudo-channel을 가진다
- Local HBM 접근은 **0 mesh hop** (switching overhead만)
- Effective bandwidth = spec BW × efficiency factor (default 0.8)
- 예: 256 GB/s × 0.8 = 204.8 GB/s effective
- 이 보장은 fabric bandwidth와 무관하게 유지된다
### 4.3 Memory-Centric Design Principle
- **Compute는 data 근처에서 실행**: PE가 local HBM에 직결되어 데이터 이동 최소화
- **TCM은 compute의 scratchpad**: 모든 compute 입출력은 TCM을 경유
- **HBM은 primary storage**: 대용량 tensor 저장, DMA로 TCM에 tile 단위 load/store
- **Shared SRAM은 cube-level 공유**: 중간 결과 공유, reduction buffer 등
---
## 5. SPMD Execution Model
### 5.1 Program ID Mapping
커널은 2D hardware grid에서 SPMD 방식으로 실행된다:
| API | 반환 값 | 설명 |
|-----|---------|------|
| `tl.program_id(axis=0)` | `local_pe_id` | Cube 내 PE 인덱스 |
| `tl.program_id(axis=1)` | `cube_id` | Cube 인덱스 |
| `tl.num_programs(axis=0)` | `num_pes_per_cube` | Cube당 PE 수 |
| `tl.num_programs(axis=1)` | `num_cubes` | 전체 Cube 수 |
```python
global_pid = tl.program_id(axis=1) * tl.num_programs(axis=0) + tl.program_id(axis=0)
```
### 5.2 Axis Mapping Rationale
- **axis=0 = PE (innermost)**: Cube 내 PE는 HBM을 공유하고 local NOC로 통신. 빠르고 tightly-coupled. GPU의 thread-in-block에 대응.
- **axis=1 = Cube (outer)**: Cube 간 통신은 UCIe 경유로 latency 높음. Coarse scheduling 단위. GPU의 block-in-grid에 대응.
### 5.3 Kernel Execution Flow
```
Host CPU
→ IO_CPU (PCIe-EP)
→ M_CPU (management, per cube)
→ PE_CPU × N (broadcast)
→ Each PE executes same kernel with unique (pe_id, cube_id)
```
모든 PE가 동일 커널을 실행하되, `program_id`로 자신의 데이터 파티션을 식별하여
독립적으로 처리한다 (SPMD).
---
## 6. Inter-PE Communication (IPCQ)
PE 간 collective communication은 IPCQ(Inter-PE Communication Queue)를 통해 수행된다.
- 각 PE는 방향별(N/S/E/W 등) ring buffer 기반 queue pair를 유지
- **DMA-IPCQ co-design**: DMA data flit에 head pointer를 piggyback하여 별도 제어 메시지 없이 pointer 동기화
- **Credit-based flow control**: Receiver가 slot 소비 후 16B credit으로 sender에게 알림
- IPCQ slot buffer는 **TCM, Shared SRAM, Local HBM** 중 선택 가능
자세한 내용은 `docs/ipcq-dma-codesign-hw.md` 및 ADR-0023 참조.