CCL allreduce: rename to lrab_hierarchical_allreduce + descriptive plots

Rename the intercube all-reduce identity to lrab_hierarchical_allreduce
(module, config key, distributed test) so the name reflects both levels
it implements: LRAB intra-SIP (local reduce to center root + broadcast)
and the hierarchical inter-SIP topology exchange (ring/torus/mesh).
ADR-0032 slug kept as the stable decision id; pure rename, no logic change.

Also in this batch:
- ADR-0032 (EN+KO): document the shipped center-root bidirectional reduce
  (doc was stale corner-root); annotate ccl.yaml root_cube as a placeholder.
- Rename allreduce + pe2pe latency plots to descriptive, title-matching
  filenames and retitle the in-plot headings; drop overview/overview_log.
- Point the PPTX image refs at the new plot names.

Doc + derived-artifact + rename only; no simulation behavior changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-20 20:50:48 -07:00
parent e77e4a1703
commit ff7d727ddd
38 changed files with 259 additions and 272 deletions
+8 -3
View File
@@ -6,7 +6,7 @@
defaults: defaults:
# Algorithm to run for this benchmark execution. # Algorithm to run for this benchmark execution.
algorithm: intercube_allreduce algorithm: lrab_hierarchical_allreduce
# IPCQ ring buffer location. # IPCQ ring buffer location.
# tcm — PE-local TCM (fast, small, conflicts with compute TCM access) # tcm — PE-local TCM (fast, small, conflicts with compute TCM access)
@@ -37,9 +37,14 @@ algorithms:
# exchange on root cube, then broadcast back. SIP topology is read # exchange on root cube, then broadcast back. SIP topology is read
# from topology.yaml → system.sips.topology. Kernel auto-selects # from topology.yaml → system.sips.topology. Kernel auto-selects
# ring / torus / mesh inter-SIP exchange pattern. # ring / torus / mesh inter-SIP exchange pattern.
intercube_allreduce: lrab_hierarchical_allreduce:
module: kernbench.ccl.algorithms.intercube_allreduce module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
topology: none topology: none
buffer_kind: tcm buffer_kind: tcm
n_elem: 8 n_elem: 8
# root_cube: the kernel currently elects the root dynamically as the
# geometric center of the cube mesh (root = (h//2)*w + (w//2)) to
# minimize the intra-SIP critical path, so this value is NOT read today.
# Kept as a placeholder for a future explicit-root override / runtime
# election hook (see ADR-0032 D1 + Non-goals).
root_cube: 15 root_cube: 15
@@ -31,7 +31,7 @@ pe0만의 same-lane 큐브 간 reduce**, 그 다음 루트 큐브에서 SIP 간
### 현재 상태 ### 현재 상태
- `src/kernbench/ccl/algorithms/intercube_allreduce.py` — 커널 - `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` — 커널
- `src/kernbench/ccl/sfr_config.py``configure_sfr_intercube_multisip` - `src/kernbench/ccl/sfr_config.py``configure_sfr_intercube_multisip`
- `src/kernbench/runtime_api/distributed.py``AhbmCCLBackend` - `src/kernbench/runtime_api/distributed.py``AhbmCCLBackend`
`init_process_group` 시점에 자동으로 와이어링한다. `init_process_group` 시점에 자동으로 와이어링한다.
@@ -42,29 +42,46 @@ pe0만의 same-lane 큐브 간 reduce**, 그 다음 루트 큐브에서 SIP 간
## Decision ## Decision
### D1. 알고리즘 구조 — 5단계 ### D1. 알고리즘 구조 — 5단계 (center-root, 양방향)
루트 큐브는 큐브 메시의 기하학적 **중심**에 위치한다:
```
root_col = cube_w // 2
root_row = cube_h // 2
root_cube = root_row * cube_w + root_col # 중심; 4×4 메시에서 10
```
각 reduce/broadcast 단계는 이 중심을 향해 **양방향으로** 수렴/발산하여,
corner-root 워크 대비 SIP 내부 임계 경로를 절반으로 줄인다 (4×4 메시:
reduce 4홉 + broadcast 4홉 vs SE-코너 루트의 6+6).
각 SIP에 대해 (`mp.spawn`으로 동시에 launch): 각 SIP에 대해 (`mp.spawn`으로 동시에 launch):
``` ```
Phase 1 — Row reduce W → E (큐브 메시, pe0만): Phase 1 — col == root_col에서 수렴하는 Row reduce (큐브 메시, pe0만):
col=0이 E로 송신 → col=1이 누적, E로 송신 → ... → col=3이 row sum 보유. 좌측 절반(col < root_col)은 W→E로, 우측 절반(col > root_col)은
E→W로 진행; root_col 큐브가 양쪽을 병합 → row sum 보유.
Phase 2 — 최우측 열에서 Col reduce N → S (pe0, col = mesh_w-1): Phase 2 — col == root_col에서 row == root_row로 수렴하는 Col reduce:
row=0이 S로 송신 → row=1이 누적, S로 송신 → ... → 루트 큐브 (15)가 위쪽(row < root_row)은 N→S로, 아래쪽(row > root_row)은 S→N로 진행;
전체 SIP sum 보유. 루트 큐브가 양쪽을 병합 → 전체 SIP sum 보유.
Phase 3 — 루트 큐브에서 SIP 간 교환 (루트 큐브의 pe0만): Phase 3 — cube_id == root_cube에서 SIP 간 교환 (pe0만):
Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast — Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
sip_topo_kind(topology.yaml의 sips.topology)로 선택. sip_topo_kind(topology.yaml의 sips.topology)로 선택.
Phase 4 — 최우측 열에서 Col 브로드캐스트 S → N. Phase 4 — col == root_col에서 root_row로부터 바깥쪽으로 Col 브로드캐스트.
Phase 5 — 큐브 메시 전반에 걸친 Row 브로드캐스트 E → W. Phase 5 — root_col로부터 바깥쪽으로 큐브 메시 전반에 Row 브로드캐스트.
``` ```
모든 단계가 끝나면 모든 큐브의 pe0이 전역 sum을 보유한다. 모든 단계가 끝나면 모든 큐브의 pe0이 전역 sum을 보유한다.
**단일 큐브 fast-path**: `cube_w == cube_h == 1`(rank당 큐브 하나, 일반적인
TP 케이스)인 경우 SIP 내부 reduce/broadcast 단계를 건너뛰고 곧바로
Phase 3 SIP 간 교환으로 진행한다.
커널은 `sip_topo_kind ∈ {0, 1, 2}`(ring_1d, torus_2d, mesh_2d_no_wrap)로 커널은 `sip_topo_kind ∈ {0, 1, 2}`(ring_1d, torus_2d, mesh_2d_no_wrap)로
파라미터화된 단일 함수이다. Phase 1-2와 4-5는 토폴로지 전반에서 동일하며, 파라미터화된 단일 함수이다. Phase 1-2와 4-5는 토폴로지 전반에서 동일하며,
phase 3만 분기한다. 헬퍼 함수 `_inter_sip_ring`, `_inter_sip_torus_2d`, phase 3만 분기한다. 헬퍼 함수 `_inter_sip_ring`, `_inter_sip_torus_2d`,
@@ -152,17 +169,19 @@ system:
```yaml ```yaml
defaults: defaults:
algorithm: intercube_allreduce algorithm: lrab_hierarchical_allreduce
buffer_kind: tcm buffer_kind: tcm
... ...
algorithms: algorithms:
intercube_allreduce: lrab_hierarchical_allreduce:
module: kernbench.ccl.algorithms.intercube_allreduce module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
topology: none topology: none
buffer_kind: tcm buffer_kind: tcm
n_elem: 8 n_elem: 8
root_cube: 15 root_cube: 15 # 현재 사용되지 않음 — 커널이 루트를 기하학적 중심으로
# 동적으로 선출한다 (D1 참조). 향후 명시적 루트 override /
# 런타임 선출 훅을 위한 placeholder로 유지한다.
``` ```
`topology.yaml`: `topology.yaml`:
@@ -205,10 +224,11 @@ sip:
- **비대칭 SIP 토폴로지** (정사각형이 아닌 메시/토러스). - **비대칭 SIP 토폴로지** (정사각형이 아닌 메시/토러스).
`torus_2d``mesh_2d_no_wrap``n_sips = k²`를 요구한다. `torus_2d``mesh_2d_no_wrap``n_sips = k²`를 요구한다.
- **파이프라인 청크**: 큐브당 단일 타일, 아직 파이프라이닝 없음. - **파이프라인 청크**: 큐브당 단일 타일, 아직 파이프라이닝 없음.
- **루트 큐브의 런타임 선출**: 커널은 현재 SE 코너로 하드코딩된 - **루트 큐브의 런타임 선출**: 커널은 현재 SIP 내부 임계 경로를
`root_cube = (mesh_h - 1) * mesh_w + (mesh_w - 1)`을 사용한다. SFR 최소화하기 위해 기하학적 중심인
와이어링이 모든 큐브를 커버하므로, 필요해질 때 런타임 선출은 순수 `root_cube = (mesh_h // 2) * mesh_w + (mesh_w // 2)`을 사용한다. SFR
커널 변경이다. 와이어링이 모든 큐브를 커버하므로, 필요해질 때 다른 루트를 런타임에
선출하는 것은 순수 커널 변경이다.
--- ---
@@ -241,15 +261,15 @@ sip:
| File | Change | | File | Change |
|---|---| |---|---|
| `src/kernbench/ccl/algorithms/intercube_allreduce.py` (신규) | 커널 + `_inter_sip_*` 헬퍼 + `TOPO_NAME_TO_KIND` | | `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` (신규) | 커널 + `_inter_sip_*` 헬퍼 + `TOPO_NAME_TO_KIND` |
| `src/kernbench/ccl/sfr_config.py` (신규) | `configure_sfr_intercube_multisip` | | `src/kernbench/ccl/sfr_config.py` (신규) | `configure_sfr_intercube_multisip` |
| `src/kernbench/ccl/topologies.py` | `torus_2d`, `mesh_2d_no_wrap` 추가 | | `src/kernbench/ccl/topologies.py` | `torus_2d`, `mesh_2d_no_wrap` 추가 |
| `src/kernbench/ccl/install.py` | `_OPPOSITE_DIR``global_*` 쌍으로 확장 | | `src/kernbench/ccl/install.py` | `_OPPOSITE_DIR``global_*` 쌍으로 확장 |
| `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend``configure_sfr_intercube_multisip` 사용 + sip_rank/topo 인자 추가 | | `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend``configure_sfr_intercube_multisip` 사용 + sip_rank/topo 인자 추가 |
| `ccl.yaml` | 단일 `intercube_allreduce` 항목 | | `ccl.yaml` | 단일 `lrab_hierarchical_allreduce` 항목 |
| `topology.yaml` | `system.sips.topology` 추가 | | `topology.yaml` | `system.sips.topology` 추가 |
| `benches/ccl_allreduce.py` | Row-wise 큐브-메시 텐서 레이아웃 | | `benches/ccl_allreduce.py` | Row-wise 큐브-메시 텐서 레이아웃 |
| `tests/test_allreduce_multidevice.py` (신규) | 구성 기반 ring/torus/mesh | | `tests/test_allreduce_multidevice.py` (신규) | 구성 기반 ring/torus/mesh |
| `tests/test_distributed_intercube_allreduce.py` (신규) | 전체 `dist.all_reduce` 경로 | | `tests/test_distributed_lrab_hierarchical_allreduce.py` (신규) | 전체 `dist.all_reduce` 경로 |
| `tests/test_intercube_sfr_config.py` (신규) | SFR 와이어링 검증 | | `tests/test_intercube_sfr_config.py` (신규) | SFR 와이어링 검증 |
| 제거 | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` 및 그 테스트 | | 제거 | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` 및 그 테스트 |
+40 -20
View File
@@ -32,7 +32,7 @@ bandwidth characteristics for the common per-cube DP workload.
### Current state ### Current state
- `src/kernbench/ccl/algorithms/intercube_allreduce.py` — kernel - `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` — kernel
- `src/kernbench/ccl/sfr_config.py``configure_sfr_intercube_multisip` - `src/kernbench/ccl/sfr_config.py``configure_sfr_intercube_multisip`
- `src/kernbench/runtime_api/distributed.py``AhbmCCLBackend` wires this - `src/kernbench/runtime_api/distributed.py``AhbmCCLBackend` wires this
automatically at `init_process_group` time. automatically at `init_process_group` time.
@@ -43,29 +43,46 @@ bandwidth characteristics for the common per-cube DP workload.
## Decision ## Decision
### D1. Algorithm structure — 5 phases ### D1. Algorithm structure — 5 phases (center-root, bidirectional)
The root cube sits at the geometric **center** of the cube mesh:
```
root_col = cube_w // 2
root_row = cube_h // 2
root_cube = root_row * cube_w + root_col # center; 10 on a 4×4 mesh
```
Each reduce/broadcast phase converges/diverges **bidirectionally** toward
this center, halving the intra-SIP critical path versus a corner-root walk
(4×4 mesh: 4 hops reduce + 4 hops broadcast vs 6+6 with an SE-corner root).
For each SIP (launched concurrently by `mp.spawn`): For each SIP (launched concurrently by `mp.spawn`):
``` ```
Phase 1 — Row reduce W → E (cube mesh, pe0 only): Phase 1 — Row reduce converging at col == root_col (cube mesh, pe0 only):
col=0 sends E → col=1 accumulates, sends E → ... → col=3 holds row sum. left half (col < root_col) walks W→E; right half (col > root_col)
walks E→W; the root_col cube merges both sides → holds row sum.
Phase 2 — Col reduce N → S on rightmost column (pe0, col = mesh_w-1): Phase 2 — Col reduce on col == root_col converging at row == root_row:
row=0 sends S → row=1 accumulates, sends S → ... → root cube (15) above (row < root_row) walks N→S; below (row > root_row) walks S→N;
holds the full SIP sum. the root cube merges both → holds the full SIP sum.
Phase 3 — Inter-SIP exchange on root cube (pe0 of root cube only): Phase 3 — Inter-SIP exchange on cube_id == root_cube (pe0 only):
Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast — Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
selected by sip_topo_kind (from topology.yaml sips.topology). selected by sip_topo_kind (from topology.yaml sips.topology).
Phase 4 — Col broadcast S → N on rightmost column. Phase 4 — Col broadcast on col == root_col, outward from root_row.
Phase 5 — Row broadcast E → W across the cube mesh. Phase 5 — Row broadcast outward from root_col across the cube mesh.
``` ```
After all phases every cube's pe0 holds the global sum. After all phases every cube's pe0 holds the global sum.
**Single-cube fast-path**: when `cube_w == cube_h == 1` (one cube per rank,
the common TP case), the intra-SIP reduce/broadcast phases are skipped and
the kernel goes straight to the Phase 3 inter-SIP exchange.
The kernel is a single function parameterised by `sip_topo_kind ∈ {0, 1, 2}` The kernel is a single function parameterised by `sip_topo_kind ∈ {0, 1, 2}`
(ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical (ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical
across topologies; only phase 3 branches. Helper functions across topologies; only phase 3 branches. Helper functions
@@ -154,17 +171,19 @@ At each `dist.all_reduce(tensor)` call:
```yaml ```yaml
defaults: defaults:
algorithm: intercube_allreduce algorithm: lrab_hierarchical_allreduce
buffer_kind: tcm buffer_kind: tcm
... ...
algorithms: algorithms:
intercube_allreduce: lrab_hierarchical_allreduce:
module: kernbench.ccl.algorithms.intercube_allreduce module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
topology: none topology: none
buffer_kind: tcm buffer_kind: tcm
n_elem: 8 n_elem: 8
root_cube: 15 root_cube: 15 # NOT read today — the kernel elects the root dynamically
# as the geometric center (see D1). Kept as a placeholder
# for a future explicit-root override / runtime election.
``` ```
`topology.yaml`: `topology.yaml`:
@@ -207,9 +226,10 @@ Modules loaded via `cfg["module"]` must export:
`mesh_2d_no_wrap` require `n_sips = k²`. `mesh_2d_no_wrap` require `n_sips = k²`.
- **Pipelined chunks**: single-tile per cube, no pipelining yet. - **Pipelined chunks**: single-tile per cube, no pipelining yet.
- **Root cube runtime election**: the kernel currently uses - **Root cube runtime election**: the kernel currently uses
`root_cube = (mesh_h - 1) * mesh_w + (mesh_w - 1)` hardcoded to the SE `root_cube = (mesh_h // 2) * mesh_w + (mesh_w // 2)` — the geometric
corner. SFR wiring covers all cubes, so runtime election is a pure kernel center, chosen to minimize the intra-SIP critical path. SFR wiring
change when needed. covers all cubes, so electing a different root at runtime is a pure
kernel change when needed.
--- ---
@@ -242,15 +262,15 @@ Modules loaded via `cfg["module"]` must export:
| File | Change | | File | Change |
|---|---| |---|---|
| `src/kernbench/ccl/algorithms/intercube_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` | | `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` |
| `src/kernbench/ccl/sfr_config.py` (new) | `configure_sfr_intercube_multisip` | | `src/kernbench/ccl/sfr_config.py` (new) | `configure_sfr_intercube_multisip` |
| `src/kernbench/ccl/topologies.py` | Added `torus_2d`, `mesh_2d_no_wrap` | | `src/kernbench/ccl/topologies.py` | Added `torus_2d`, `mesh_2d_no_wrap` |
| `src/kernbench/ccl/install.py` | Extended `_OPPOSITE_DIR` with `global_*` pairs | | `src/kernbench/ccl/install.py` | Extended `_OPPOSITE_DIR` with `global_*` pairs |
| `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend` uses `configure_sfr_intercube_multisip` + appends sip_rank/topo args | | `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend` uses `configure_sfr_intercube_multisip` + appends sip_rank/topo args |
| `ccl.yaml` | Single `intercube_allreduce` entry | | `ccl.yaml` | Single `lrab_hierarchical_allreduce` entry |
| `topology.yaml` | Added `system.sips.topology` | | `topology.yaml` | Added `system.sips.topology` |
| `benches/ccl_allreduce.py` | Row-wise cube-mesh tensor layout | | `benches/ccl_allreduce.py` | Row-wise cube-mesh tensor layout |
| `tests/test_allreduce_multidevice.py` (new) | Config-driven ring/torus/mesh | | `tests/test_allreduce_multidevice.py` (new) | Config-driven ring/torus/mesh |
| `tests/test_distributed_intercube_allreduce.py` (new) | Full `dist.all_reduce` path | | `tests/test_distributed_lrab_hierarchical_allreduce.py` (new) | Full `dist.all_reduce` path |
| `tests/test_intercube_sfr_config.py` (new) | SFR wiring verification | | `tests/test_intercube_sfr_config.py` (new) | SFR wiring verification |
| Removed | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` and their tests | | Removed | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` and their tests |
Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 75 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 75 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 86 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 39 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 79 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 80 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 75 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 46 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 51 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 43 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 53 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 51 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 135 KiB

After

Width:  |  Height:  |  Size: 137 KiB

+80 -80
View File
@@ -1,81 +1,81 @@
hop,label,size_bytes,path,total_ns hop,label,size_bytes,path,total_ns
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,ipcq,24.88749999999891 latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),128,ipcq,24.88749999999891
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,raw,33.57999999999811 latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),128,raw,33.57999999999811
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,ipcq,28.13749999999891 latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),256,ipcq,28.13749999999891
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,raw,36.07999999999811 latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),256,raw,36.07999999999811
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,ipcq,29.88749999999891 latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),384,ipcq,29.88749999999891
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,raw,37.07999999999811 latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),384,raw,37.07999999999811
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,ipcq,31.63749999999891 latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),512,ipcq,31.63749999999891
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,raw,38.07999999999811 latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),512,raw,38.07999999999811
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,ipcq,35.13749999999891 latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),768,ipcq,35.13749999999891
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,raw,40.07999999999811 latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),768,raw,40.07999999999811
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,ipcq,38.63749999999891 latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),1024,ipcq,38.63749999999891
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,raw,42.07999999999811 latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),1024,raw,42.07999999999811
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,ipcq,52.63749999999891 latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),2048,ipcq,52.63749999999891
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,raw,50.07999999999811 latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),2048,raw,50.07999999999811
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,ipcq,80.63750000000073 latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),4096,ipcq,80.63750000000073
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,raw,66.08000000000175 latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),4096,raw,66.08000000000175
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,ipcq,136.63750000000073 latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),8192,ipcq,136.63750000000073
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,raw,98.08000000000175 latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),8192,raw,98.08000000000175
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,ipcq,164.63750000000073 latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),10240,ipcq,164.63750000000073
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,raw,114.08000000000175 latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),10240,raw,114.08000000000175
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,ipcq,38.49749999999585 latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),128,ipcq,38.49749999999585
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,raw,47.18999999999505 latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),128,raw,47.18999999999505
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,ipcq,43.24749999999585 latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),256,ipcq,43.24749999999585
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,raw,51.18999999999505 latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),256,raw,51.18999999999505
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,ipcq,44.99749999999585 latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),384,ipcq,44.99749999999585
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,raw,52.18999999999505 latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),384,raw,52.18999999999505
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,ipcq,46.74749999999585 latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),512,ipcq,46.74749999999585
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,raw,53.18999999999505 latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),512,raw,53.18999999999505
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,ipcq,50.24749999999585 latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),768,ipcq,50.24749999999585
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,raw,55.18999999999505 latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),768,raw,55.18999999999505
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,ipcq,53.74749999999585 latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),1024,ipcq,53.74749999999585
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,raw,57.18999999999505 latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),1024,raw,57.18999999999505
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,ipcq,67.74749999999585 latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),2048,ipcq,67.74749999999585
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,raw,65.18999999999505 latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),2048,raw,65.18999999999505
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,ipcq,95.74750000000131 latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),4096,ipcq,95.74750000000131
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,raw,81.19000000000233 latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),4096,raw,81.19000000000233
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,ipcq,151.7475000000013 latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),8192,ipcq,151.7475000000013
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,raw,113.19000000000233 latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),8192,raw,113.19000000000233
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,ipcq,179.7475000000013 latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),10240,ipcq,179.7475000000013
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,raw,129.19000000000233 latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),10240,raw,129.19000000000233
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,ipcq,81.15999999999804 latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),128,ipcq,81.15999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,raw,89.28999999999724 latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),128,raw,89.28999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,ipcq,88.65999999999804 latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),256,ipcq,88.65999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,raw,95.53999999999724 latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),256,raw,95.53999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,ipcq,90.90999999999804 latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),384,ipcq,90.90999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,raw,96.53999999999724 latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),384,raw,96.53999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,ipcq,93.15999999999804 latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),512,ipcq,93.15999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,raw,97.53999999999724 latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),512,raw,97.53999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,ipcq,97.65999999999804 latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),768,ipcq,97.65999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,raw,99.53999999999724 latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),768,raw,99.53999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,ipcq,103.15999999999804 latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),1024,ipcq,103.15999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,raw,102.53999999999724 latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),1024,raw,102.53999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,ipcq,125.15999999999804 latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),2048,ipcq,125.15999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,raw,114.53999999999724 latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),2048,raw,114.53999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,ipcq,169.15999999999985 latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),4096,ipcq,169.15999999999985
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,raw,138.54000000000087 latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),4096,raw,138.54000000000087
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,ipcq,257.15999999999985 latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),8192,ipcq,257.15999999999985
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,raw,186.54000000000087 latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),8192,raw,186.54000000000087
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,ipcq,301.15999999999985 latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),10240,ipcq,301.15999999999985
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,raw,210.54000000000087 latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),10240,raw,210.54000000000087
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,ipcq,103.15999999999804 latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),128,ipcq,103.15999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,raw,111.28999999999724 latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),128,raw,111.28999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,ipcq,112.65999999999804 latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),256,ipcq,112.65999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,raw,119.53999999999724 latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),256,raw,119.53999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,ipcq,114.90999999999804 latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),384,ipcq,114.90999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,raw,120.53999999999724 latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),384,raw,120.53999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,ipcq,117.15999999999804 latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),512,ipcq,117.15999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,raw,121.53999999999724 latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),512,raw,121.53999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,ipcq,121.65999999999804 latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),768,ipcq,121.65999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,raw,123.53999999999724 latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),768,raw,123.53999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,ipcq,127.15999999999804 latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),1024,ipcq,127.15999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,raw,126.53999999999724 latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),1024,raw,126.53999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,ipcq,149.15999999999804 latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),2048,ipcq,149.15999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,raw,138.53999999999724 latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),2048,raw,138.53999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,ipcq,193.15999999999985 latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),4096,ipcq,193.15999999999985
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,raw,162.54000000000087 latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),4096,raw,162.54000000000087
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,ipcq,281.15999999999985 latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),8192,ipcq,281.15999999999985
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,raw,210.54000000000087 latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),8192,raw,210.54000000000087
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,ipcq,325.15999999999985 latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),10240,ipcq,325.15999999999985
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,raw,234.54000000000087 latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),10240,raw,234.54000000000087
1 hop label size_bytes path total_ns
2 h1_intra_horizontal latency_intracube_PE0_to_PE1_horizontal Intra-cube horizontal (pe0 to pe1) Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal) 128 ipcq 24.88749999999891
3 h1_intra_horizontal latency_intracube_PE0_to_PE1_horizontal Intra-cube horizontal (pe0 to pe1) Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal) 128 raw 33.57999999999811
4 h1_intra_horizontal latency_intracube_PE0_to_PE1_horizontal Intra-cube horizontal (pe0 to pe1) Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal) 256 ipcq 28.13749999999891
5 h1_intra_horizontal latency_intracube_PE0_to_PE1_horizontal Intra-cube horizontal (pe0 to pe1) Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal) 256 raw 36.07999999999811
6 h1_intra_horizontal latency_intracube_PE0_to_PE1_horizontal Intra-cube horizontal (pe0 to pe1) Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal) 384 ipcq 29.88749999999891
7 h1_intra_horizontal latency_intracube_PE0_to_PE1_horizontal Intra-cube horizontal (pe0 to pe1) Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal) 384 raw 37.07999999999811
8 h1_intra_horizontal latency_intracube_PE0_to_PE1_horizontal Intra-cube horizontal (pe0 to pe1) Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal) 512 ipcq 31.63749999999891
9 h1_intra_horizontal latency_intracube_PE0_to_PE1_horizontal Intra-cube horizontal (pe0 to pe1) Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal) 512 raw 38.07999999999811
10 h1_intra_horizontal latency_intracube_PE0_to_PE1_horizontal Intra-cube horizontal (pe0 to pe1) Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal) 768 ipcq 35.13749999999891
11 h1_intra_horizontal latency_intracube_PE0_to_PE1_horizontal Intra-cube horizontal (pe0 to pe1) Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal) 768 raw 40.07999999999811
12 h1_intra_horizontal latency_intracube_PE0_to_PE1_horizontal Intra-cube horizontal (pe0 to pe1) Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal) 1024 ipcq 38.63749999999891
13 h1_intra_horizontal latency_intracube_PE0_to_PE1_horizontal Intra-cube horizontal (pe0 to pe1) Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal) 1024 raw 42.07999999999811
14 h1_intra_horizontal latency_intracube_PE0_to_PE1_horizontal Intra-cube horizontal (pe0 to pe1) Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal) 2048 ipcq 52.63749999999891
15 h1_intra_horizontal latency_intracube_PE0_to_PE1_horizontal Intra-cube horizontal (pe0 to pe1) Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal) 2048 raw 50.07999999999811
16 h1_intra_horizontal latency_intracube_PE0_to_PE1_horizontal Intra-cube horizontal (pe0 to pe1) Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal) 4096 ipcq 80.63750000000073
17 h1_intra_horizontal latency_intracube_PE0_to_PE1_horizontal Intra-cube horizontal (pe0 to pe1) Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal) 4096 raw 66.08000000000175
18 h1_intra_horizontal latency_intracube_PE0_to_PE1_horizontal Intra-cube horizontal (pe0 to pe1) Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal) 8192 ipcq 136.63750000000073
19 h1_intra_horizontal latency_intracube_PE0_to_PE1_horizontal Intra-cube horizontal (pe0 to pe1) Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal) 8192 raw 98.08000000000175
20 h1_intra_horizontal latency_intracube_PE0_to_PE1_horizontal Intra-cube horizontal (pe0 to pe1) Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal) 10240 ipcq 164.63750000000073
21 h1_intra_horizontal latency_intracube_PE0_to_PE1_horizontal Intra-cube horizontal (pe0 to pe1) Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal) 10240 raw 114.08000000000175
22 h2_intra_vertical latency_intracube_PE0_to_PE4_vertical Intra-cube vertical (pe0 to pe4) Intra-cube PE-to-PE latency: PE0 → PE4 (vertical) 128 ipcq 38.49749999999585
23 h2_intra_vertical latency_intracube_PE0_to_PE4_vertical Intra-cube vertical (pe0 to pe4) Intra-cube PE-to-PE latency: PE0 → PE4 (vertical) 128 raw 47.18999999999505
24 h2_intra_vertical latency_intracube_PE0_to_PE4_vertical Intra-cube vertical (pe0 to pe4) Intra-cube PE-to-PE latency: PE0 → PE4 (vertical) 256 ipcq 43.24749999999585
25 h2_intra_vertical latency_intracube_PE0_to_PE4_vertical Intra-cube vertical (pe0 to pe4) Intra-cube PE-to-PE latency: PE0 → PE4 (vertical) 256 raw 51.18999999999505
26 h2_intra_vertical latency_intracube_PE0_to_PE4_vertical Intra-cube vertical (pe0 to pe4) Intra-cube PE-to-PE latency: PE0 → PE4 (vertical) 384 ipcq 44.99749999999585
27 h2_intra_vertical latency_intracube_PE0_to_PE4_vertical Intra-cube vertical (pe0 to pe4) Intra-cube PE-to-PE latency: PE0 → PE4 (vertical) 384 raw 52.18999999999505
28 h2_intra_vertical latency_intracube_PE0_to_PE4_vertical Intra-cube vertical (pe0 to pe4) Intra-cube PE-to-PE latency: PE0 → PE4 (vertical) 512 ipcq 46.74749999999585
29 h2_intra_vertical latency_intracube_PE0_to_PE4_vertical Intra-cube vertical (pe0 to pe4) Intra-cube PE-to-PE latency: PE0 → PE4 (vertical) 512 raw 53.18999999999505
30 h2_intra_vertical latency_intracube_PE0_to_PE4_vertical Intra-cube vertical (pe0 to pe4) Intra-cube PE-to-PE latency: PE0 → PE4 (vertical) 768 ipcq 50.24749999999585
31 h2_intra_vertical latency_intracube_PE0_to_PE4_vertical Intra-cube vertical (pe0 to pe4) Intra-cube PE-to-PE latency: PE0 → PE4 (vertical) 768 raw 55.18999999999505
32 h2_intra_vertical latency_intracube_PE0_to_PE4_vertical Intra-cube vertical (pe0 to pe4) Intra-cube PE-to-PE latency: PE0 → PE4 (vertical) 1024 ipcq 53.74749999999585
33 h2_intra_vertical latency_intracube_PE0_to_PE4_vertical Intra-cube vertical (pe0 to pe4) Intra-cube PE-to-PE latency: PE0 → PE4 (vertical) 1024 raw 57.18999999999505
34 h2_intra_vertical latency_intracube_PE0_to_PE4_vertical Intra-cube vertical (pe0 to pe4) Intra-cube PE-to-PE latency: PE0 → PE4 (vertical) 2048 ipcq 67.74749999999585
35 h2_intra_vertical latency_intracube_PE0_to_PE4_vertical Intra-cube vertical (pe0 to pe4) Intra-cube PE-to-PE latency: PE0 → PE4 (vertical) 2048 raw 65.18999999999505
36 h2_intra_vertical latency_intracube_PE0_to_PE4_vertical Intra-cube vertical (pe0 to pe4) Intra-cube PE-to-PE latency: PE0 → PE4 (vertical) 4096 ipcq 95.74750000000131
37 h2_intra_vertical latency_intracube_PE0_to_PE4_vertical Intra-cube vertical (pe0 to pe4) Intra-cube PE-to-PE latency: PE0 → PE4 (vertical) 4096 raw 81.19000000000233
38 h2_intra_vertical latency_intracube_PE0_to_PE4_vertical Intra-cube vertical (pe0 to pe4) Intra-cube PE-to-PE latency: PE0 → PE4 (vertical) 8192 ipcq 151.7475000000013
39 h2_intra_vertical latency_intracube_PE0_to_PE4_vertical Intra-cube vertical (pe0 to pe4) Intra-cube PE-to-PE latency: PE0 → PE4 (vertical) 8192 raw 113.19000000000233
40 h2_intra_vertical latency_intracube_PE0_to_PE4_vertical Intra-cube vertical (pe0 to pe4) Intra-cube PE-to-PE latency: PE0 → PE4 (vertical) 10240 ipcq 179.7475000000013
41 h2_intra_vertical latency_intracube_PE0_to_PE4_vertical Intra-cube vertical (pe0 to pe4) Intra-cube PE-to-PE latency: PE0 → PE4 (vertical) 10240 raw 129.19000000000233
42 h3_inter_cube_horizontal latency_intercube_C0PE0_to_C1PE0_horizontal Inter-cube horizontal (cube0 to cube1) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal) 128 ipcq 81.15999999999804
43 h3_inter_cube_horizontal latency_intercube_C0PE0_to_C1PE0_horizontal Inter-cube horizontal (cube0 to cube1) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal) 128 raw 89.28999999999724
44 h3_inter_cube_horizontal latency_intercube_C0PE0_to_C1PE0_horizontal Inter-cube horizontal (cube0 to cube1) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal) 256 ipcq 88.65999999999804
45 h3_inter_cube_horizontal latency_intercube_C0PE0_to_C1PE0_horizontal Inter-cube horizontal (cube0 to cube1) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal) 256 raw 95.53999999999724
46 h3_inter_cube_horizontal latency_intercube_C0PE0_to_C1PE0_horizontal Inter-cube horizontal (cube0 to cube1) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal) 384 ipcq 90.90999999999804
47 h3_inter_cube_horizontal latency_intercube_C0PE0_to_C1PE0_horizontal Inter-cube horizontal (cube0 to cube1) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal) 384 raw 96.53999999999724
48 h3_inter_cube_horizontal latency_intercube_C0PE0_to_C1PE0_horizontal Inter-cube horizontal (cube0 to cube1) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal) 512 ipcq 93.15999999999804
49 h3_inter_cube_horizontal latency_intercube_C0PE0_to_C1PE0_horizontal Inter-cube horizontal (cube0 to cube1) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal) 512 raw 97.53999999999724
50 h3_inter_cube_horizontal latency_intercube_C0PE0_to_C1PE0_horizontal Inter-cube horizontal (cube0 to cube1) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal) 768 ipcq 97.65999999999804
51 h3_inter_cube_horizontal latency_intercube_C0PE0_to_C1PE0_horizontal Inter-cube horizontal (cube0 to cube1) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal) 768 raw 99.53999999999724
52 h3_inter_cube_horizontal latency_intercube_C0PE0_to_C1PE0_horizontal Inter-cube horizontal (cube0 to cube1) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal) 1024 ipcq 103.15999999999804
53 h3_inter_cube_horizontal latency_intercube_C0PE0_to_C1PE0_horizontal Inter-cube horizontal (cube0 to cube1) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal) 1024 raw 102.53999999999724
54 h3_inter_cube_horizontal latency_intercube_C0PE0_to_C1PE0_horizontal Inter-cube horizontal (cube0 to cube1) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal) 2048 ipcq 125.15999999999804
55 h3_inter_cube_horizontal latency_intercube_C0PE0_to_C1PE0_horizontal Inter-cube horizontal (cube0 to cube1) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal) 2048 raw 114.53999999999724
56 h3_inter_cube_horizontal latency_intercube_C0PE0_to_C1PE0_horizontal Inter-cube horizontal (cube0 to cube1) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal) 4096 ipcq 169.15999999999985
57 h3_inter_cube_horizontal latency_intercube_C0PE0_to_C1PE0_horizontal Inter-cube horizontal (cube0 to cube1) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal) 4096 raw 138.54000000000087
58 h3_inter_cube_horizontal latency_intercube_C0PE0_to_C1PE0_horizontal Inter-cube horizontal (cube0 to cube1) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal) 8192 ipcq 257.15999999999985
59 h3_inter_cube_horizontal latency_intercube_C0PE0_to_C1PE0_horizontal Inter-cube horizontal (cube0 to cube1) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal) 8192 raw 186.54000000000087
60 h3_inter_cube_horizontal latency_intercube_C0PE0_to_C1PE0_horizontal Inter-cube horizontal (cube0 to cube1) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal) 10240 ipcq 301.15999999999985
61 h3_inter_cube_horizontal latency_intercube_C0PE0_to_C1PE0_horizontal Inter-cube horizontal (cube0 to cube1) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal) 10240 raw 210.54000000000087
62 h4_inter_cube_vertical latency_intercube_C0PE0_to_C4PE0_vertical Inter-cube vertical (cube0 to cube4) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical) 128 ipcq 103.15999999999804
63 h4_inter_cube_vertical latency_intercube_C0PE0_to_C4PE0_vertical Inter-cube vertical (cube0 to cube4) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical) 128 raw 111.28999999999724
64 h4_inter_cube_vertical latency_intercube_C0PE0_to_C4PE0_vertical Inter-cube vertical (cube0 to cube4) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical) 256 ipcq 112.65999999999804
65 h4_inter_cube_vertical latency_intercube_C0PE0_to_C4PE0_vertical Inter-cube vertical (cube0 to cube4) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical) 256 raw 119.53999999999724
66 h4_inter_cube_vertical latency_intercube_C0PE0_to_C4PE0_vertical Inter-cube vertical (cube0 to cube4) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical) 384 ipcq 114.90999999999804
67 h4_inter_cube_vertical latency_intercube_C0PE0_to_C4PE0_vertical Inter-cube vertical (cube0 to cube4) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical) 384 raw 120.53999999999724
68 h4_inter_cube_vertical latency_intercube_C0PE0_to_C4PE0_vertical Inter-cube vertical (cube0 to cube4) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical) 512 ipcq 117.15999999999804
69 h4_inter_cube_vertical latency_intercube_C0PE0_to_C4PE0_vertical Inter-cube vertical (cube0 to cube4) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical) 512 raw 121.53999999999724
70 h4_inter_cube_vertical latency_intercube_C0PE0_to_C4PE0_vertical Inter-cube vertical (cube0 to cube4) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical) 768 ipcq 121.65999999999804
71 h4_inter_cube_vertical latency_intercube_C0PE0_to_C4PE0_vertical Inter-cube vertical (cube0 to cube4) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical) 768 raw 123.53999999999724
72 h4_inter_cube_vertical latency_intercube_C0PE0_to_C4PE0_vertical Inter-cube vertical (cube0 to cube4) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical) 1024 ipcq 127.15999999999804
73 h4_inter_cube_vertical latency_intercube_C0PE0_to_C4PE0_vertical Inter-cube vertical (cube0 to cube4) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical) 1024 raw 126.53999999999724
74 h4_inter_cube_vertical latency_intercube_C0PE0_to_C4PE0_vertical Inter-cube vertical (cube0 to cube4) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical) 2048 ipcq 149.15999999999804
75 h4_inter_cube_vertical latency_intercube_C0PE0_to_C4PE0_vertical Inter-cube vertical (cube0 to cube4) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical) 2048 raw 138.53999999999724
76 h4_inter_cube_vertical latency_intercube_C0PE0_to_C4PE0_vertical Inter-cube vertical (cube0 to cube4) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical) 4096 ipcq 193.15999999999985
77 h4_inter_cube_vertical latency_intercube_C0PE0_to_C4PE0_vertical Inter-cube vertical (cube0 to cube4) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical) 4096 raw 162.54000000000087
78 h4_inter_cube_vertical latency_intercube_C0PE0_to_C4PE0_vertical Inter-cube vertical (cube0 to cube4) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical) 8192 ipcq 281.15999999999985
79 h4_inter_cube_vertical latency_intercube_C0PE0_to_C4PE0_vertical Inter-cube vertical (cube0 to cube4) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical) 8192 raw 210.54000000000087
80 h4_inter_cube_vertical latency_intercube_C0PE0_to_C4PE0_vertical Inter-cube vertical (cube0 to cube4) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical) 10240 ipcq 325.15999999999985
81 h4_inter_cube_vertical latency_intercube_C0PE0_to_C4PE0_vertical Inter-cube vertical (cube0 to cube4) Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical) 10240 raw 234.54000000000087
+4 -4
View File
@@ -4,8 +4,8 @@ Slides:
1. Overall architecture — how PEs are connected (cube_mesh_view) 1. Overall architecture — how PEs are connected (cube_mesh_view)
2. Model correctness — DMA vs P2P latency (pe2pe overview) 2. Model correctness — DMA vs P2P latency (pe2pe overview)
3. PE-to-PE IPCQ communication (ipcq_two_pe_dma) 3. PE-to-PE IPCQ communication (ipcq_two_pe_dma)
4. 6-device allreduce — model vs theoretical vs ext-sim (overview_broken) 4. 6-device allreduce — model vs theoretical vs FSIM (comparison_…_fsim)
5. IPCQ buffer-kind sweep — TCM vs SRAM vs HBM (buffer_kind_sweep) 5. IPCQ buffer-kind sweep — TCM vs SRAM vs HBM (…_with_TCM_SRAM_HBM)
6. PE_accelerator data path (composite GEMM pipeline structure) 6. PE_accelerator data path (composite GEMM pipeline structure)
7. matmul(32, 128, 32) — composite GEMM execution sequence 7. matmul(32, 128, 32) — composite GEMM execution sequence
8. matmul(32, 128, 128) — pipeline scaling and HBM contention 8. matmul(32, 128, 128) — pipeline scaling and HBM contention
@@ -63,7 +63,7 @@ SLIDES = [
}, },
{ {
"title": "4. 6-Device Allreduce: Model vs Theoretical vs External Simulator", "title": "4. 6-Device Allreduce: Model vs Theoretical vs External Simulator",
"image": DIAG / "allreduce_latency_plots" / "overview_broken.png", "image": DIAG / "allreduce_latency_plots" / "comparison_mesh_vs_ring_vs_2DTorus_vs_theoretical_vs_fsim.png",
"bullets": [ "bullets": [
"Three SIP topologies (ring / torus / mesh) swept 16 B → 96 KB per PE", "Three SIP topologies (ring / torus / mesh) swept 16 B → 96 KB per PE",
"Dashed red curve: hand-derived theoretical model for torus_2d (6 SIPs)", "Dashed red curve: hand-derived theoretical model for torus_2d (6 SIPs)",
@@ -73,7 +73,7 @@ SLIDES = [
}, },
{ {
"title": "5. IPCQ Slot Memory: TCM vs SRAM vs HBM", "title": "5. IPCQ Slot Memory: TCM vs SRAM vs HBM",
"image": DIAG / "allreduce_latency_plots" / "buffer_kind_sweep.png", "image": DIAG / "allreduce_latency_plots" / "AllReduce_LRAB_2Dtorus_6SiP_2x3_with_TCM_SRAM_HBM.png",
"bullets": [ "bullets": [
"Same allreduce with slot memory swapped: TCM (per-PE local) / SRAM / HBM (cube-shared, behind router link)", "Same allreduce with slot memory swapped: TCM (per-PE local) / SRAM / HBM (cube-shared, behind router link)",
"Cost = NoC drain + slot-IO + PE↔bank hop; only TCM skips the bank hop", "Cost = NoC drain + slot-IO + PE↔bank hop; only TCM skips the bank hop",
+18 -34
View File
@@ -1,6 +1,7 @@
"""One-shot: render overview.png with an external 366 µs reference, in two """One-shot: render the broken-y-axis allreduce comparison with the FSIM
variants — log scale and broken y-axis. Reads docs/diagrams/allreduce_latency_plots/summary.csv single-device reference. Reads docs/diagrams/allreduce_latency_plots/summary.csv
and writes overview_log.png and overview_broken.png alongside it. and writes comparison_mesh_vs_ring_vs_2DTorus_vs_theoretical_vs_fsim.png
alongside it.
This is a derived-artifact generator (per CLAUDE.md): plotting only, no production This is a derived-artifact generator (per CLAUDE.md): plotting only, no production
or test logic touched. or test logic touched.
@@ -17,7 +18,7 @@ ROOT = Path(__file__).resolve().parent.parent
PLOT_DIR = ROOT / "docs" / "diagrams" / "allreduce_latency_plots" PLOT_DIR = ROOT / "docs" / "diagrams" / "allreduce_latency_plots"
CSV_PATH = PLOT_DIR / "summary.csv" CSV_PATH = PLOT_DIR / "summary.csv"
EXT_LABEL = "ext-sim single-device reduce: 366 µs" EXT_LABEL = "FSIM (single device): 366 µs"
EXT_LATENCY_NS = 366_000.0 EXT_LATENCY_NS = 366_000.0
COLORS = { COLORS = {
@@ -26,6 +27,15 @@ COLORS = {
"mesh_2d_no_wrap": "tab:green", "mesh_2d_no_wrap": "tab:green",
} }
# Display labels (data keys above stay as the summary.csv sip_topology
# values; these are only the human-readable legend strings). All non-FSIM
# runs use 6 devices; the grid differs per topology.
DISPLAY = {
"ring_1d": "Ring 1x6 (6 devices)",
"torus_2d": "2D Torus 2x3 (6 devices)",
"mesh_2d_no_wrap": "2D Mesh 2x3 (6 devices)",
}
# Hand-derived theoretical model for torus_2d (6 SIPs). Mirrors # Hand-derived theoretical model for torus_2d (6 SIPs). Mirrors
# _aggregate_sweep_plots in tests/test_allreduce_multidevice.py. # _aggregate_sweep_plots in tests/test_allreduce_multidevice.py.
NOC_PACKET_BYTES = 128 NOC_PACKET_BYTES = 128
@@ -51,7 +61,7 @@ def _plot_theoretical(ax, records):
[r["bytes_per_pe"] for r in torus_rs], [r["bytes_per_pe"] for r in torus_rs],
[_theoretical_torus_2d_ns(r["bytes_per_pe"]) for r in torus_rs], [_theoretical_torus_2d_ns(r["bytes_per_pe"]) for r in torus_rs],
color="tab:red", linestyle="--", linewidth=1.6, marker="x", color="tab:red", linestyle="--", linewidth=1.6, marker="x",
label="theoretical torus_2d (6 SIPs)", label="Theoretical 2D Torus 2x3",
) )
@@ -91,36 +101,11 @@ def _plot_curves(ax, records, topologies):
[r["bytes_per_pe"] for r in rs], [r["bytes_per_pe"] for r in rs],
[r["latency_ns"] for r in rs], [r["latency_ns"] for r in rs],
marker="o", marker="o",
label=f"{topo}", label=DISPLAY.get(topo, topo),
color=COLORS.get(topo), color=COLORS.get(topo),
) )
def emit_log(records):
topologies = sorted({r["sip_topology"] for r in records})
fig, ax = plt.subplots(figsize=(9, 6))
_plot_curves(ax, records, topologies)
_plot_theoretical(ax, records)
ax.scatter(
[_ext_x(records)], [EXT_LATENCY_NS],
marker="*", s=220, color="tab:red", zorder=5,
label=EXT_LABEL,
)
ax.set_xscale("log", base=2)
ax.set_yscale("log")
ax.set_xlabel("Bytes per PE (log scale)")
ax.set_ylabel("Time (ns) — log scale")
ax.set_title("Multi-device allreduce latency vs external single-device reference")
ax.grid(True, which="both", alpha=0.3)
ax.xaxis.set_major_formatter(mticker.FuncFormatter(_bytes_fmt))
ax.legend(loc="upper left")
fig.tight_layout()
out = PLOT_DIR / "overview_log.png"
fig.savefig(out, dpi=120)
plt.close(fig)
print(f"wrote {out}")
def emit_broken(records): def emit_broken(records):
topologies = sorted({r["sip_topology"] for r in records}) topologies = sorted({r["sip_topology"] for r in records})
max_local = max(r["latency_ns"] for r in records) max_local = max(r["latency_ns"] for r in records)
@@ -172,9 +157,9 @@ def emit_broken(records):
ax_bot.legend(handles_bot + handles_top, labels_bot + labels_top, ax_bot.legend(handles_bot + handles_top, labels_bot + labels_top,
loc="upper left") loc="upper left")
fig.suptitle("Multi-device allreduce latency vs external single-device reference (broken y-axis)") fig.suptitle("Multidevice allreduce (ring, Mesh, 2DTorus) vs FSIM latency")
fig.tight_layout() fig.tight_layout()
out = PLOT_DIR / "overview_broken.png" out = PLOT_DIR / "comparison_mesh_vs_ring_vs_2DTorus_vs_theoretical_vs_fsim.png"
fig.savefig(out, dpi=120) fig.savefig(out, dpi=120)
plt.close(fig) plt.close(fig)
print(f"wrote {out}") print(f"wrote {out}")
@@ -184,7 +169,6 @@ def main():
records = _load_records() records = _load_records()
if not records: if not records:
raise SystemExit(f"no rows in {CSV_PATH}") raise SystemExit(f"no rows in {CSV_PATH}")
emit_log(records)
emit_broken(records) emit_broken(records)
+12 -9
View File
@@ -3,7 +3,8 @@
Parametrized over (buffer_kind, n_elem). Each case runs the standard Parametrized over (buffer_kind, n_elem). Each case runs the standard
config-driven allreduce app and writes a JSON row to a shared staging config-driven allreduce app and writes a JSON row to a shared staging
dir; the conftest sessionfinish hook (added in Phase 1) aggregates dir; the conftest sessionfinish hook (added in Phase 1) aggregates
rows into ``docs/diagrams/allreduce_latency_plots/buffer_kind_sweep.png``. rows into ``docs/diagrams/allreduce_latency_plots/
AllReduce_LRAB_2Dtorus_6SiP_2x3_with_TCM_SRAM_HBM.png``.
Pre-Phase-2: the three buffer-kind lines overlap exactly because slot Pre-Phase-2: the three buffer-kind lines overlap exactly because slot
access is latency-free today. Post-Phase-2 they spread out (tcm access is latency-free today. Post-Phase-2 they spread out (tcm
@@ -36,6 +37,8 @@ _ELEM_BYTES_F16 = 2
_OUT_DIR = (Path(__file__).parent.parent / "docs" / "diagrams" _OUT_DIR = (Path(__file__).parent.parent / "docs" / "diagrams"
/ "allreduce_latency_plots") / "allreduce_latency_plots")
_ROWS_DIR = _OUT_DIR / "_buffer_kind_rows" _ROWS_DIR = _OUT_DIR / "_buffer_kind_rows"
# Descriptive output stem (shared by the .png and .csv).
_OUT_STEM = "AllReduce_LRAB_2Dtorus_6SiP_2x3_with_TCM_SRAM_HBM"
def _bk_params(): def _bk_params():
@@ -55,7 +58,7 @@ def test_buffer_kind_allreduce_one(tmp_path, buffer_kind, n_elem):
sub, sub,
sip_topology="torus_2d", sip_topology="torus_2d",
n_sips=6, n_sips=6,
algorithm="intercube_allreduce", algorithm="lrab_hierarchical_allreduce",
sip_w=3, sip_h=2, sip_w=3, sip_h=2,
n_elem_override=n_elem, n_elem_override=n_elem,
) )
@@ -64,7 +67,7 @@ def test_buffer_kind_allreduce_one(tmp_path, buffer_kind, n_elem):
ccl_cfg = yaml.safe_load(f) ccl_cfg = yaml.safe_load(f)
ccl_cfg.setdefault("defaults", {})["buffer_kind"] = buffer_kind ccl_cfg.setdefault("defaults", {})["buffer_kind"] = buffer_kind
ccl_cfg.setdefault("algorithms", {}).setdefault( ccl_cfg.setdefault("algorithms", {}).setdefault(
"intercube_allreduce", {}, "lrab_hierarchical_allreduce", {},
)["buffer_kind"] = buffer_kind )["buffer_kind"] = buffer_kind
with open(ccl_path, "w") as f: with open(ccl_path, "w") as f:
yaml.dump(ccl_cfg, f, default_flow_style=False) yaml.dump(ccl_cfg, f, default_flow_style=False)
@@ -81,7 +84,7 @@ def test_buffer_kind_allreduce_one(tmp_path, buffer_kind, n_elem):
) as ctx: ) as ctx:
result = run_allreduce( result = run_allreduce(
ctx, engine, spec, ctx, engine, spec,
algorithm="intercube_allreduce", ccl_yaml=ccl_path, algorithm="lrab_hierarchical_allreduce", ccl_yaml=ccl_path,
) )
assert result["ok_cubes"] > 0 assert result["ok_cubes"] > 0
@@ -108,7 +111,7 @@ def test_buffer_kind_allreduce_one(tmp_path, buffer_kind, n_elem):
def aggregate_buffer_kind_plot() -> bool: def aggregate_buffer_kind_plot() -> bool:
"""Read per-config rows and emit buffer_kind_sweep.png + CSV. """Read per-config rows and emit the descriptive .png + .csv (_OUT_STEM).
Called from conftest.pytest_sessionfinish (controller-only). Called from conftest.pytest_sessionfinish (controller-only).
Returns True if rows were aggregated. Returns True if rows were aggregated.
@@ -141,7 +144,7 @@ def aggregate_buffer_kind_plot() -> bool:
_bytes_fmt = FuncFormatter(_fmt_bytes) _bytes_fmt = FuncFormatter(_fmt_bytes)
_OUT_DIR.mkdir(parents=True, exist_ok=True) _OUT_DIR.mkdir(parents=True, exist_ok=True)
with open(_OUT_DIR / "buffer_kind_sweep.csv", "w", with open(_OUT_DIR / f"{_OUT_STEM}.csv", "w",
newline="", encoding="utf-8") as f: newline="", encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=[ w = csv.DictWriter(f, fieldnames=[
"buffer_kind", "sip_topology", "n_sips", "n_elem", "buffer_kind", "sip_topology", "n_sips", "n_elem",
@@ -172,13 +175,13 @@ def aggregate_buffer_kind_plot() -> bool:
ax.set_xlabel("Bytes per PE (log scale)") ax.set_xlabel("Bytes per PE (log scale)")
ax.set_ylabel("Time (ns)") ax.set_ylabel("Time (ns)")
ax.set_title( ax.set_title(
"Allreduce torus_2d (6 SIPs, 3×2) — IPCQ slot memory tier" "AllReduce_LRAB_2Dtorus_6SiP(2x3) — IPCQ memory (SRAM, TCM, HBM)"
) )
ax.grid(True, alpha=0.3) ax.grid(True, alpha=0.3)
ax.legend() ax.legend()
ax.xaxis.set_major_formatter(_bytes_fmt) ax.xaxis.set_major_formatter(_bytes_fmt)
fig.tight_layout() fig.tight_layout()
fig.savefig(_OUT_DIR / "buffer_kind_sweep.png", dpi=130) fig.savefig(_OUT_DIR / f"{_OUT_STEM}.png", dpi=130)
plt.close(fig) plt.close(fig)
for p in row_files: for p in row_files:
@@ -191,6 +194,6 @@ def aggregate_buffer_kind_plot() -> bool:
except OSError: except OSError:
pass pass
print(f"\nWrote {_OUT_DIR / 'buffer_kind_sweep.png'} " print(f"\nWrote {_OUT_DIR / f'{_OUT_STEM}.png'} "
f"from {len(records)} rows") f"from {len(records)} rows")
return True return True
+28 -77
View File
@@ -189,15 +189,15 @@ TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
CONFIGS = [ CONFIGS = [
pytest.param( pytest.param(
"intercube_allreduce", "ring_1d", 6, None, None, "lrab_hierarchical_allreduce", "ring_1d", 6, None, None,
id="ring_6sip", id="ring_6sip",
), ),
pytest.param( pytest.param(
"intercube_allreduce", "torus_2d", 6, 2, 3, "lrab_hierarchical_allreduce", "torus_2d", 6, 2, 3,
id="torus_6sip_2x3", id="torus_6sip_2x3",
), ),
pytest.param( pytest.param(
"intercube_allreduce", "mesh_2d_no_wrap", 6, 2, 3, "lrab_hierarchical_allreduce", "mesh_2d_no_wrap", 6, 2, 3,
id="mesh_6sip_2x3", id="mesh_6sip_2x3",
), ),
] ]
@@ -280,9 +280,9 @@ _SWEEP_N_ELEM = [
_ELEM_BYTES_F16 = 2 _ELEM_BYTES_F16 = 2
_SWEEP_TOPOLOGIES = [ _SWEEP_TOPOLOGIES = [
("intercube_allreduce", "ring_1d", 6, None, None), ("lrab_hierarchical_allreduce", "ring_1d", 6, None, None),
("intercube_allreduce", "torus_2d", 6, 2, 3), ("lrab_hierarchical_allreduce", "torus_2d", 6, 2, 3),
("intercube_allreduce", "mesh_2d_no_wrap", 6, 2, 3), ("lrab_hierarchical_allreduce", "mesh_2d_no_wrap", 6, 2, 3),
] ]
# Shared on-disk staging dir for parametrized sweep rows. Each # Shared on-disk staging dir for parametrized sweep rows. Each
@@ -440,10 +440,22 @@ def _aggregate_sweep_plots() -> bool:
continue continue
xs = [r["bytes_per_pe"] for r in rs] xs = [r["bytes_per_pe"] for r in rs]
ys = [r["latency_ns"] for r in rs] ys = [r["latency_ns"] for r in rs]
title = ( _per_topo_titles = {
f"Allreduce latency — {topo_name} " "ring_1d": "AllReduce_LRAB_Ring1D_6SiP(1x6)",
f"(n_sips={rs[0]['n_sips']})" "torus_2d": "AllReduce_LRAB_2Dtorus_6SiP(2x3)",
"mesh_2d_no_wrap": "AllReduce_LRAB_2DMesh_6SiP(2x3)",
}
# Descriptive output filenames (parens → underscores for
# markdown/URL safety; topo key stays the summary.csv value).
_per_topo_files = {
"ring_1d": "AllReduce_LRAB_Ring1D_6SiP_1x6",
"torus_2d": "AllReduce_LRAB_2Dtorus_6SiP_2x3",
"mesh_2d_no_wrap": "AllReduce_LRAB_2DMesh_6SiP_2x3",
}
title = _per_topo_titles.get(
topo_name, f"Allreduce latency — {topo_name}"
) )
out_stem = _per_topo_files.get(topo_name, topo_name)
fig, ax = plt.subplots(figsize=(8, 5)) fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(xs, ys, marker="o", color="tab:blue") ax.plot(xs, ys, marker="o", color="tab:blue")
ax.set_xscale("log", base=2) ax.set_xscale("log", base=2)
@@ -453,75 +465,14 @@ def _aggregate_sweep_plots() -> bool:
ax.grid(True, alpha=0.3) ax.grid(True, alpha=0.3)
ax.xaxis.set_major_formatter(_bytes_fmt) ax.xaxis.set_major_formatter(_bytes_fmt)
fig.tight_layout() fig.tight_layout()
fig.savefig(_SWEEP_OUT_DIR / f"{topo_name}.png", dpi=120) fig.savefig(_SWEEP_OUT_DIR / f"{out_stem}.png", dpi=120)
plt.close(fig) plt.close(fig)
colors = {"ring_1d": "tab:blue", "torus_2d": "tab:orange", # Combined overview.png is no longer emitted — the broken-y-axis
"mesh_2d_no_wrap": "tab:green"} # comparison (scripts/emit_overview_with_external_ref.py →
# comparison_mesh_vs_ring_vs_2DTorus_vs_theoretical_vs_fsim.png)
# ── Hand-derived theoretical model for torus_2d (6 SIPs) ── # supersedes it. Per-topology plots above and summary.csv are still
# Critical-path analysis (per packet, packet = 128 B at NoC): # produced.
# local intra-SIP reduce + broadcast = 8 hops × 57 ns = 456 ns
# global X-direction reduce = 5 UCIe + 1 UAL = 445 ns
# global Y-direction reduce = 5 UCIe + 1 UAL = 445 ns
# per-packet startup latency = 456 + 445 + 445 = 1346 ns
# Packet count is PER CUBE (8 PEs/cube cooperate on the cube tile).
# At 6144 packets/cube the pipelined total is 8741 ns, so the
# bottleneck-stage interval τ = (8741 1346) / (6144 1) ≈ 1.204 ns.
# T_theoretical(N) = 1346 + (N 1) × τ
# where N = ceil((bytes_per_pe × 8) / 128) = ceil(bytes_per_pe / 16)
NOC_PACKET_BYTES = 128
PES_PER_CUBE = 8
T_STARTUP_NS = 1346.0
TAU_NS = (8741.0 - 1346.0) / (6144 - 1) # ≈ 1.2038 ns/packet
def _theoretical_torus_2d_ns(bytes_per_pe: int) -> float:
bytes_per_cube = int(bytes_per_pe) * PES_PER_CUBE
n_packets = max(1, -(-bytes_per_cube // NOC_PACKET_BYTES)) # ceil
return T_STARTUP_NS + (n_packets - 1) * TAU_NS
fig, ax = plt.subplots(figsize=(9, 6))
for topo_name in topologies:
rs = sorted(
[r for r in records if r["sip_topology"] == topo_name],
key=lambda r: r["bytes_per_pe"],
)
if not rs:
continue
ax.plot(
[r["bytes_per_pe"] for r in rs],
[r["latency_ns"] for r in rs],
marker="o",
label=f"{topo_name} (n_sips={rs[0]['n_sips']})",
color=colors.get(topo_name),
)
# Theoretical torus_2d curve across all payload sizes.
torus_rs = sorted(
[r for r in records if r["sip_topology"] == "torus_2d"],
key=lambda r: r["bytes_per_pe"],
)
if torus_rs:
xs_th = [r["bytes_per_pe"] for r in torus_rs]
ys_th = [_theoretical_torus_2d_ns(r["bytes_per_pe"]) for r in torus_rs]
ax.plot(
xs_th, ys_th,
color="tab:red", linestyle="--", linewidth=1.6, marker="x",
label="theoretical torus_2d (6 SIPs)",
)
ax.set_xscale("log", base=2)
ax.set_xlabel("Bytes per PE (log scale)")
ax.set_ylabel("Time (ns)")
ax.set_title("Multi-device allreduce latency by topology")
ax.grid(True, alpha=0.3)
ax.set_xlim(left=min(r["bytes_per_pe"] for r in records) / 2,
right=max(r["bytes_per_pe"] for r in records) * 1.5)
ax.legend()
ax.xaxis.set_major_formatter(_bytes_fmt)
fig.tight_layout()
fig.savefig(_SWEEP_OUT_DIR / "overview.png", dpi=120)
plt.close(fig)
# Cleanup row staging dir so a partial future run doesn't pick up # Cleanup row staging dir so a partial future run doesn't pick up
# stale rows. # stale rows.
@@ -535,7 +486,7 @@ def _aggregate_sweep_plots() -> bool:
except OSError: except OSError:
pass pass
print(f"\nWrote {_SWEEP_OUT_DIR / 'overview.png'} " print(f"\nWrote per-topology plots + summary.csv to {_SWEEP_OUT_DIR} "
f"from {len(records)} rows") f"from {len(records)} rows")
return True return True
@@ -25,7 +25,7 @@ N_ELEM = 8
def _write_ccl_yaml(tmp_path) -> str: def _write_ccl_yaml(tmp_path) -> str:
body = textwrap.dedent("""\ body = textwrap.dedent("""\
defaults: defaults:
algorithm: intercube_allreduce algorithm: lrab_hierarchical_allreduce
buffer_kind: tcm buffer_kind: tcm
backpressure: sleep backpressure: sleep
n_slots: 4 n_slots: 4
@@ -34,8 +34,8 @@ def _write_ccl_yaml(tmp_path) -> str:
ipcq_credit_size_bytes: 16 ipcq_credit_size_bytes: 16
algorithms: algorithms:
intercube_allreduce: lrab_hierarchical_allreduce:
module: kernbench.ccl.algorithms.intercube_allreduce module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
topology: none topology: none
buffer_kind: tcm buffer_kind: tcm
n_elem: 8 n_elem: 8
@@ -80,11 +80,11 @@ def _worker(rank: int, n_sips: int, torch) -> None:
) )
if rank == 0: if rank == 0:
print(f"\n intercube_allreduce (ws={n_sips}): " print(f"\n lrab_hierarchical_allreduce (ws={n_sips}): "
f"{n_sips * N_CUBES} OK") f"{n_sips * N_CUBES} OK")
def test_distributed_intercube_allreduce(tmp_path, monkeypatch): def test_distributed_lrab_hierarchical_allreduce(tmp_path, monkeypatch):
"""Full distributed path: init_process_group → mp.spawn → all_reduce.""" """Full distributed path: init_process_group → mp.spawn → all_reduce."""
from kernbench.runtime_api.context import RuntimeContext from kernbench.runtime_api.context import RuntimeContext
from kernbench.runtime_api.types import DeviceSelector from kernbench.runtime_api.types import DeviceSelector
+6 -6
View File
@@ -1,7 +1,7 @@
"""Phase 1 test for moving the intercube_allreduce root cube from the """Phase 1 test for moving the lrab_hierarchical_allreduce root cube from the
bottom-right corner (3,3) to the geometric center (2,2). bottom-right corner (3,3) to the geometric center (2,2).
Today's algorithm (intercube_allreduce.py) hardcodes Today's algorithm (lrab_hierarchical_allreduce.py) hardcodes
``root_cube = (cube_h-1) * cube_w + (cube_w-1)`` (= cube 15 in 4×4). ``root_cube = (cube_h-1) * cube_w + (cube_w-1)`` (= cube 15 in 4×4).
The intra-SIP critical path for one allreduce is therefore:: The intra-SIP critical path for one allreduce is therefore::
@@ -55,7 +55,7 @@ def _run_torus_96kb(tmp_path: Path) -> float:
sub, sub,
sip_topology="torus_2d", sip_topology="torus_2d",
n_sips=6, n_sips=6,
algorithm="intercube_allreduce", algorithm="lrab_hierarchical_allreduce",
sip_w=3, sip_h=2, sip_w=3, sip_h=2,
n_elem_override=49152, # 49152 × 2 = 96 KB / slot n_elem_override=49152, # 49152 × 2 = 96 KB / slot
) )
@@ -70,7 +70,7 @@ def _run_torus_96kb(tmp_path: Path) -> float:
) as ctx: ) as ctx:
result = run_allreduce( result = run_allreduce(
ctx, engine, spec, ctx, engine, spec,
algorithm="intercube_allreduce", ccl_yaml=ccl_path, algorithm="lrab_hierarchical_allreduce", ccl_yaml=ccl_path,
) )
assert result["ok_cubes"] > 0 assert result["ok_cubes"] > 0
pe_exec_vals = [ pe_exec_vals = [
@@ -121,7 +121,7 @@ def test_correctness_preserved(tmp_path):
sub, sub,
sip_topology="torus_2d", sip_topology="torus_2d",
n_sips=6, n_sips=6,
algorithm="intercube_allreduce", algorithm="lrab_hierarchical_allreduce",
sip_w=3, sip_h=2, sip_w=3, sip_h=2,
n_elem_override=128, # tiny payload to keep this fast n_elem_override=128, # tiny payload to keep this fast
) )
@@ -136,7 +136,7 @@ def test_correctness_preserved(tmp_path):
) as ctx: ) as ctx:
result = run_allreduce( result = run_allreduce(
ctx, engine, spec, ctx, engine, spec,
algorithm="intercube_allreduce", ccl_yaml=ccl_path, algorithm="lrab_hierarchical_allreduce", ccl_yaml=ccl_path,
) )
n_cubes = 6 * 16 # 6 SIPs × 16 cubes/SIP n_cubes = 6 * 16 # 6 SIPs × 16 cubes/SIP
assert result["ok_cubes"] == n_cubes, ( assert result["ok_cubes"] == n_cubes, (
+1 -1
View File
@@ -28,7 +28,7 @@ def _engine_and_spec():
def _merged_cfg(): def _merged_cfg():
cfg = load_ccl_config() cfg = load_ccl_config()
return resolve_algorithm_config(cfg, name="intercube_allreduce") return resolve_algorithm_config(cfg, name="lrab_hierarchical_allreduce")
class TestConfigureSfrNeighborTables: class TestConfigureSfrNeighborTables:
+3 -3
View File
@@ -81,7 +81,7 @@ def _run_torus_allreduce(
sub, sub,
sip_topology="torus_2d", sip_topology="torus_2d",
n_sips=6, n_sips=6,
algorithm="intercube_allreduce", algorithm="lrab_hierarchical_allreduce",
sip_w=3, sip_h=2, sip_w=3, sip_h=2,
n_elem_override=n_elem, n_elem_override=n_elem,
) )
@@ -92,7 +92,7 @@ def _run_torus_allreduce(
ccl_cfg = yaml.safe_load(f) ccl_cfg = yaml.safe_load(f)
ccl_cfg.setdefault("defaults", {})["buffer_kind"] = buffer_kind ccl_cfg.setdefault("defaults", {})["buffer_kind"] = buffer_kind
ccl_cfg.setdefault("algorithms", {}).setdefault( ccl_cfg.setdefault("algorithms", {}).setdefault(
"intercube_allreduce", {}, "lrab_hierarchical_allreduce", {},
)["buffer_kind"] = buffer_kind )["buffer_kind"] = buffer_kind
with open(ccl_path, "w") as f: with open(ccl_path, "w") as f:
yaml.dump(ccl_cfg, f, default_flow_style=False) yaml.dump(ccl_cfg, f, default_flow_style=False)
@@ -109,7 +109,7 @@ def _run_torus_allreduce(
) as ctx: ) as ctx:
result = run_allreduce( result = run_allreduce(
ctx, engine, spec, ctx, engine, spec,
algorithm="intercube_allreduce", ccl_yaml=ccl_path, algorithm="lrab_hierarchical_allreduce", ccl_yaml=ccl_path,
) )
assert result["ok_cubes"] > 0, "allreduce did not validate" assert result["ok_cubes"] > 0, "allreduce did not validate"
+3 -3
View File
@@ -68,7 +68,7 @@ def _run_allreduce_with_buffer_kind(
sub, sub,
sip_topology="torus_2d", sip_topology="torus_2d",
n_sips=6, n_sips=6,
algorithm="intercube_allreduce", algorithm="lrab_hierarchical_allreduce",
sip_w=3, sip_h=2, sip_w=3, sip_h=2,
n_elem_override=n_elem, n_elem_override=n_elem,
) )
@@ -77,7 +77,7 @@ def _run_allreduce_with_buffer_kind(
ccl_cfg = yaml.safe_load(f) ccl_cfg = yaml.safe_load(f)
ccl_cfg.setdefault("defaults", {})["buffer_kind"] = buffer_kind ccl_cfg.setdefault("defaults", {})["buffer_kind"] = buffer_kind
ccl_cfg.setdefault("algorithms", {}).setdefault( ccl_cfg.setdefault("algorithms", {}).setdefault(
"intercube_allreduce", {}, "lrab_hierarchical_allreduce", {},
)["buffer_kind"] = buffer_kind )["buffer_kind"] = buffer_kind
with open(ccl_path, "w") as f: with open(ccl_path, "w") as f:
yaml.dump(ccl_cfg, f, default_flow_style=False) yaml.dump(ccl_cfg, f, default_flow_style=False)
@@ -94,7 +94,7 @@ def _run_allreduce_with_buffer_kind(
) as ctx: ) as ctx:
result = run_allreduce( result = run_allreduce(
ctx, engine, spec, ctx, engine, spec,
algorithm="intercube_allreduce", ccl_yaml=ccl_path, algorithm="lrab_hierarchical_allreduce", ccl_yaml=ccl_path,
) )
assert result["ok_cubes"] > 0, "allreduce did not validate" assert result["ok_cubes"] > 0, "allreduce did not validate"
+1 -1
View File
@@ -472,7 +472,7 @@ def _run_ipcq():
dst_sip, dst_cube, dst_pe = DST dst_sip, dst_cube, dst_pe = DST
cfg = load_ccl_config() cfg = load_ccl_config()
merged = resolve_algorithm_config(cfg, name="intercube_allreduce") merged = resolve_algorithm_config(cfg, name="lrab_hierarchical_allreduce")
merged["slot_size"] = max(int(merged.get("slot_size", 4096)), NBYTES) merged["slot_size"] = max(int(merged.get("slot_size", 4096)), NBYTES)
with RuntimeContext( with RuntimeContext(
+9 -5
View File
@@ -56,13 +56,17 @@ class Hop:
HOPS = [ HOPS = [
Hop("h1_intra_horizontal", "Intra-cube horizontal (pe0 to pe1)", Hop("latency_intracube_PE0_to_PE1_horizontal",
"Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal)",
(0, 0, 0), (0, 0, 1), "intra_E", "intra_W", True), (0, 0, 0), (0, 0, 1), "intra_E", "intra_W", True),
Hop("h2_intra_vertical", "Intra-cube vertical (pe0 to pe4)", Hop("latency_intracube_PE0_to_PE4_vertical",
"Intra-cube PE-to-PE latency: PE0 → PE4 (vertical)",
(0, 0, 0), (0, 0, 4), "intra_S", "intra_N", True), (0, 0, 0), (0, 0, 4), "intra_S", "intra_N", True),
Hop("h3_inter_cube_horizontal", "Inter-cube horizontal (cube0 to cube1)", Hop("latency_intercube_C0PE0_to_C1PE0_horizontal",
"Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal)",
(0, 0, 0), (0, 1, 0), "E", "W", True), (0, 0, 0), (0, 1, 0), "E", "W", True),
Hop("h4_inter_cube_vertical", "Inter-cube vertical (cube0 to cube4)", Hop("latency_intercube_C0PE0_to_C4PE0_vertical",
"Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical)",
(0, 0, 0), (0, 4, 0), "S", "N", True), (0, 0, 0), (0, 4, 0), "S", "N", True),
] ]
@@ -80,7 +84,7 @@ def _measure_ipcq(hop: Hop, nbytes: int) -> float:
engine, spec = _make_engine() engine, spec = _make_engine()
cfg = load_ccl_config() cfg = load_ccl_config()
merged = resolve_algorithm_config(cfg, name="intercube_allreduce") merged = resolve_algorithm_config(cfg, name="lrab_hierarchical_allreduce")
merged["slot_size"] = max(int(merged.get("slot_size", 4096)), nbytes) merged["slot_size"] = max(int(merged.get("slot_size", 4096)), nbytes)
n_elem = nbytes // ELEM_BYTES n_elem = nbytes // ELEM_BYTES