1c5752a9ec
Move the algorithmic root cube from the corner (cube_w-1, cube_h-1) to the geometric center (cube_w//2, cube_h//2) and have each phase converge bidirectionally so the intra-SIP critical path drops from ~12 hops to ~8 hops on a 4×4 mesh (left half W→E + right half E→W in row reduce; top half N→S + bottom half S→N in col reduce; mirrored on broadcast). Result on torus_2d 6 SIPs at 96 KB / PE on TCM: before (corner root) : 22.0 µs after (center root) : 17.2 µs (−22%) Same shape on ring_1d (−7%) and mesh_2d_no_wrap (−12%); also holds across SRAM and HBM (~−20% each). Phase 1 test (test_intercube_root_center.py) asserts the torus_2d 96 KB latency drops below 20.5 µs and that all 96 cubes still validate (correctness preserved). Plot updates: - overview.png: replace constant 10.6 µs theoretical line with user-supplied hand-derived curve (per-cube packet count = bytes_per_pe × 8 PEs ÷ 128 B; 1346 ns startup + 1.20 ns/pkt). - All summary.csv numbers and per-topology PNGs regenerated. - pe2pe_latency_plots and ipcq diagram emitter PNGs refreshed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
38 lines
2.4 KiB
CSV
38 lines
2.4 KiB
CSV
algorithm,sip_topology,n_sips,n_elem,bytes_per_pe,bytes_per_sip,latency_ns
|
|
intercube_allreduce,mesh_2d_no_wrap,6,8,16,256,2626.302499999998
|
|
intercube_allreduce,mesh_2d_no_wrap,6,32,64,1024,2634.7399999999952
|
|
intercube_allreduce,mesh_2d_no_wrap,6,64,128,2048,2645.9899999999925
|
|
intercube_allreduce,mesh_2d_no_wrap,6,128,256,4096,2668.489999999987
|
|
intercube_allreduce,mesh_2d_no_wrap,6,512,1024,16384,2812.489999999987
|
|
intercube_allreduce,mesh_2d_no_wrap,6,1024,2048,32768,3010.489999999987
|
|
intercube_allreduce,mesh_2d_no_wrap,6,2048,4096,65536,3406.489999999987
|
|
intercube_allreduce,mesh_2d_no_wrap,6,4096,8192,131072,4198.489999999965
|
|
intercube_allreduce,mesh_2d_no_wrap,6,8192,16384,262144,5782.489999999969
|
|
intercube_allreduce,mesh_2d_no_wrap,6,16384,32768,524288,8950.489999999925
|
|
intercube_allreduce,mesh_2d_no_wrap,6,32768,65536,1048576,15286.48999999986
|
|
intercube_allreduce,mesh_2d_no_wrap,6,49152,98304,1572864,21622.489999999932
|
|
intercube_allreduce,ring_1d,6,8,16,256,2302.9849999999933
|
|
intercube_allreduce,ring_1d,6,32,64,1024,2310.8599999999906
|
|
intercube_allreduce,ring_1d,6,64,128,2048,2321.359999999988
|
|
intercube_allreduce,ring_1d,6,128,256,4096,2342.3599999999824
|
|
intercube_allreduce,ring_1d,6,512,1024,16384,2479.3599999999824
|
|
intercube_allreduce,ring_1d,6,1024,2048,32768,2669.3599999999824
|
|
intercube_allreduce,ring_1d,6,2048,4096,65536,3049.3599999999824
|
|
intercube_allreduce,ring_1d,6,4096,8192,131072,3809.3599999999715
|
|
intercube_allreduce,ring_1d,6,8192,16384,262144,5329.359999999979
|
|
intercube_allreduce,ring_1d,6,16384,32768,524288,8369.35999999992
|
|
intercube_allreduce,ring_1d,6,32768,65536,1048576,14449.359999999899
|
|
intercube_allreduce,ring_1d,6,49152,98304,1572864,20529.35999999997
|
|
intercube_allreduce,torus_2d,6,8,16,256,1644.2899999999936
|
|
intercube_allreduce,torus_2d,6,32,64,1024,1651.0399999999909
|
|
intercube_allreduce,torus_2d,6,64,128,2048,1660.0399999999881
|
|
intercube_allreduce,torus_2d,6,128,256,4096,1678.0399999999827
|
|
intercube_allreduce,torus_2d,6,512,1024,16384,1795.0399999999827
|
|
intercube_allreduce,torus_2d,6,1024,2048,32768,1957.0399999999827
|
|
intercube_allreduce,torus_2d,6,2048,4096,65536,2281.0399999999827
|
|
intercube_allreduce,torus_2d,6,4096,8192,131072,2929.039999999979
|
|
intercube_allreduce,torus_2d,6,8192,16384,262144,4225.039999999986
|
|
intercube_allreduce,torus_2d,6,16384,32768,524288,6817.039999999943
|
|
intercube_allreduce,torus_2d,6,32768,65536,1048576,12001.03999999992
|
|
intercube_allreduce,torus_2d,6,49152,98304,1572864,17185.039999999994
|