Intercube allreduce: center root + bidirectional reduce

Move the algorithmic root cube from the corner (cube_w-1,
cube_h-1) to the geometric center (cube_w//2, cube_h//2) and
have each phase converge bidirectionally so the intra-SIP
critical path drops from ~12 hops to ~8 hops on a 4×4 mesh
(left half W→E + right half E→W in row reduce; top half N→S +
bottom half S→N in col reduce; mirrored on broadcast).

Result on torus_2d 6 SIPs at 96 KB / PE on TCM:
  before (corner root)  : 22.0 µs
  after  (center root)  : 17.2 µs   (−22%)

Same shape on ring_1d (−7%) and mesh_2d_no_wrap (−12%); also
holds across SRAM and HBM (~−20% each).

Phase 1 test (test_intercube_root_center.py) asserts the
torus_2d 96 KB latency drops below 20.5 µs and that all 96
cubes still validate (correctness preserved).

Plot updates:
- overview.png: replace constant 10.6 µs theoretical line with
  user-supplied hand-derived curve (per-cube packet count =
  bytes_per_pe × 8 PEs ÷ 128 B; 1346 ns startup + 1.20 ns/pkt).
- All summary.csv numbers and per-topology PNGs regenerated.
- pe2pe_latency_plots and ipcq diagram emitter PNGs refreshed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-27 21:28:58 -07:00
parent 84a1325e5c
commit 1c5752a9ec
16 changed files with 324 additions and 157 deletions
@@ -1,37 +1,37 @@
algorithm,sip_topology,n_sips,n_elem,bytes_per_pe,bytes_per_sip,latency_ns
intercube_allreduce,mesh_2d_no_wrap,6,8,16,256,3508.4249999999993
intercube_allreduce,mesh_2d_no_wrap,6,32,64,1024,3515.55
intercube_allreduce,mesh_2d_no_wrap,6,64,128,2048,3525.0499999999975
intercube_allreduce,mesh_2d_no_wrap,6,128,256,4096,3544.049999999992
intercube_allreduce,mesh_2d_no_wrap,6,512,1024,16384,3667.049999999992
intercube_allreduce,mesh_2d_no_wrap,6,1024,2048,32768,3837.049999999992
intercube_allreduce,mesh_2d_no_wrap,6,2048,4096,65536,4177.049999999992
intercube_allreduce,mesh_2d_no_wrap,6,4096,8192,131072,4857.049999999959
intercube_allreduce,mesh_2d_no_wrap,6,8192,16384,262144,6217.049999999945
intercube_allreduce,mesh_2d_no_wrap,6,16384,32768,524288,8937.049999999937
intercube_allreduce,mesh_2d_no_wrap,6,32768,65536,1048576,14377.049999999872
intercube_allreduce,mesh_2d_no_wrap,6,49152,98304,1572864,19817.049999999872
intercube_allreduce,ring_1d,6,8,16,256,3073.1299999999937
intercube_allreduce,ring_1d,6,32,64,1024,3079.8799999999947
intercube_allreduce,ring_1d,6,64,128,2048,3088.879999999992
intercube_allreduce,ring_1d,6,128,256,4096,3106.8799999999865
intercube_allreduce,ring_1d,6,512,1024,16384,3225.8799999999865
intercube_allreduce,ring_1d,6,1024,2048,32768,3391.8799999999865
intercube_allreduce,ring_1d,6,2048,4096,65536,3723.8799999999865
intercube_allreduce,ring_1d,6,4096,8192,131072,4387.879999999965
intercube_allreduce,ring_1d,6,8192,16384,262144,5715.879999999957
intercube_allreduce,ring_1d,6,16384,32768,524288,8371.879999999932
intercube_allreduce,ring_1d,6,32768,65536,1048576,13683.879999999903
intercube_allreduce,ring_1d,6,49152,98304,1572864,18995.879999999917
intercube_allreduce,torus_2d,6,8,16,256,2190.4799999999923
intercube_allreduce,torus_2d,6,32,64,1024,2196.479999999993
intercube_allreduce,torus_2d,6,64,128,2048,2204.4799999999905
intercube_allreduce,torus_2d,6,128,256,4096,2220.479999999985
intercube_allreduce,torus_2d,6,512,1024,16384,2325.479999999985
intercube_allreduce,torus_2d,6,1024,2048,32768,2471.479999999985
intercube_allreduce,torus_2d,6,2048,4096,65536,2763.479999999985
intercube_allreduce,torus_2d,6,4096,8192,131072,3347.4799999999777
intercube_allreduce,torus_2d,6,8192,16384,262144,4515.4799999999705
intercube_allreduce,torus_2d,6,16384,32768,524288,6851.479999999952
intercube_allreduce,torus_2d,6,32768,65536,1048576,11523.479999999923
intercube_allreduce,torus_2d,6,49152,98304,1572864,16195.479999999952
intercube_allreduce,mesh_2d_no_wrap,6,8,16,256,2626.302499999998
intercube_allreduce,mesh_2d_no_wrap,6,32,64,1024,2634.7399999999952
intercube_allreduce,mesh_2d_no_wrap,6,64,128,2048,2645.9899999999925
intercube_allreduce,mesh_2d_no_wrap,6,128,256,4096,2668.489999999987
intercube_allreduce,mesh_2d_no_wrap,6,512,1024,16384,2812.489999999987
intercube_allreduce,mesh_2d_no_wrap,6,1024,2048,32768,3010.489999999987
intercube_allreduce,mesh_2d_no_wrap,6,2048,4096,65536,3406.489999999987
intercube_allreduce,mesh_2d_no_wrap,6,4096,8192,131072,4198.489999999965
intercube_allreduce,mesh_2d_no_wrap,6,8192,16384,262144,5782.489999999969
intercube_allreduce,mesh_2d_no_wrap,6,16384,32768,524288,8950.489999999925
intercube_allreduce,mesh_2d_no_wrap,6,32768,65536,1048576,15286.48999999986
intercube_allreduce,mesh_2d_no_wrap,6,49152,98304,1572864,21622.489999999932
intercube_allreduce,ring_1d,6,8,16,256,2302.9849999999933
intercube_allreduce,ring_1d,6,32,64,1024,2310.8599999999906
intercube_allreduce,ring_1d,6,64,128,2048,2321.359999999988
intercube_allreduce,ring_1d,6,128,256,4096,2342.3599999999824
intercube_allreduce,ring_1d,6,512,1024,16384,2479.3599999999824
intercube_allreduce,ring_1d,6,1024,2048,32768,2669.3599999999824
intercube_allreduce,ring_1d,6,2048,4096,65536,3049.3599999999824
intercube_allreduce,ring_1d,6,4096,8192,131072,3809.3599999999715
intercube_allreduce,ring_1d,6,8192,16384,262144,5329.359999999979
intercube_allreduce,ring_1d,6,16384,32768,524288,8369.35999999992
intercube_allreduce,ring_1d,6,32768,65536,1048576,14449.359999999899
intercube_allreduce,ring_1d,6,49152,98304,1572864,20529.35999999997
intercube_allreduce,torus_2d,6,8,16,256,1644.2899999999936
intercube_allreduce,torus_2d,6,32,64,1024,1651.0399999999909
intercube_allreduce,torus_2d,6,64,128,2048,1660.0399999999881
intercube_allreduce,torus_2d,6,128,256,4096,1678.0399999999827
intercube_allreduce,torus_2d,6,512,1024,16384,1795.0399999999827
intercube_allreduce,torus_2d,6,1024,2048,32768,1957.0399999999827
intercube_allreduce,torus_2d,6,2048,4096,65536,2281.0399999999827
intercube_allreduce,torus_2d,6,4096,8192,131072,2929.039999999979
intercube_allreduce,torus_2d,6,8192,16384,262144,4225.039999999986
intercube_allreduce,torus_2d,6,16384,32768,524288,6817.039999999943
intercube_allreduce,torus_2d,6,32768,65536,1048576,12001.03999999992
intercube_allreduce,torus_2d,6,49152,98304,1572864,17185.039999999994
1 algorithm sip_topology n_sips n_elem bytes_per_pe bytes_per_sip latency_ns
2 intercube_allreduce mesh_2d_no_wrap 6 8 16 256 3508.4249999999993 2626.302499999998
3 intercube_allreduce mesh_2d_no_wrap 6 32 64 1024 3515.55 2634.7399999999952
4 intercube_allreduce mesh_2d_no_wrap 6 64 128 2048 3525.0499999999975 2645.9899999999925
5 intercube_allreduce mesh_2d_no_wrap 6 128 256 4096 3544.049999999992 2668.489999999987
6 intercube_allreduce mesh_2d_no_wrap 6 512 1024 16384 3667.049999999992 2812.489999999987
7 intercube_allreduce mesh_2d_no_wrap 6 1024 2048 32768 3837.049999999992 3010.489999999987
8 intercube_allreduce mesh_2d_no_wrap 6 2048 4096 65536 4177.049999999992 3406.489999999987
9 intercube_allreduce mesh_2d_no_wrap 6 4096 8192 131072 4857.049999999959 4198.489999999965
10 intercube_allreduce mesh_2d_no_wrap 6 8192 16384 262144 6217.049999999945 5782.489999999969
11 intercube_allreduce mesh_2d_no_wrap 6 16384 32768 524288 8937.049999999937 8950.489999999925
12 intercube_allreduce mesh_2d_no_wrap 6 32768 65536 1048576 14377.049999999872 15286.48999999986
13 intercube_allreduce mesh_2d_no_wrap 6 49152 98304 1572864 19817.049999999872 21622.489999999932
14 intercube_allreduce ring_1d 6 8 16 256 3073.1299999999937 2302.9849999999933
15 intercube_allreduce ring_1d 6 32 64 1024 3079.8799999999947 2310.8599999999906
16 intercube_allreduce ring_1d 6 64 128 2048 3088.879999999992 2321.359999999988
17 intercube_allreduce ring_1d 6 128 256 4096 3106.8799999999865 2342.3599999999824
18 intercube_allreduce ring_1d 6 512 1024 16384 3225.8799999999865 2479.3599999999824
19 intercube_allreduce ring_1d 6 1024 2048 32768 3391.8799999999865 2669.3599999999824
20 intercube_allreduce ring_1d 6 2048 4096 65536 3723.8799999999865 3049.3599999999824
21 intercube_allreduce ring_1d 6 4096 8192 131072 4387.879999999965 3809.3599999999715
22 intercube_allreduce ring_1d 6 8192 16384 262144 5715.879999999957 5329.359999999979
23 intercube_allreduce ring_1d 6 16384 32768 524288 8371.879999999932 8369.35999999992
24 intercube_allreduce ring_1d 6 32768 65536 1048576 13683.879999999903 14449.359999999899
25 intercube_allreduce ring_1d 6 49152 98304 1572864 18995.879999999917 20529.35999999997
26 intercube_allreduce torus_2d 6 8 16 256 2190.4799999999923 1644.2899999999936
27 intercube_allreduce torus_2d 6 32 64 1024 2196.479999999993 1651.0399999999909
28 intercube_allreduce torus_2d 6 64 128 2048 2204.4799999999905 1660.0399999999881
29 intercube_allreduce torus_2d 6 128 256 4096 2220.479999999985 1678.0399999999827
30 intercube_allreduce torus_2d 6 512 1024 16384 2325.479999999985 1795.0399999999827
31 intercube_allreduce torus_2d 6 1024 2048 32768 2471.479999999985 1957.0399999999827
32 intercube_allreduce torus_2d 6 2048 4096 65536 2763.479999999985 2281.0399999999827
33 intercube_allreduce torus_2d 6 4096 8192 131072 3347.4799999999777 2929.039999999979
34 intercube_allreduce torus_2d 6 8192 16384 262144 4515.4799999999705 4225.039999999986
35 intercube_allreduce torus_2d 6 16384 32768 524288 6851.479999999952 6817.039999999943
36 intercube_allreduce torus_2d 6 32768 65536 1048576 11523.479999999923 12001.03999999992
37 intercube_allreduce torus_2d 6 49152 98304 1572864 16195.479999999952 17185.039999999994