fca24feac5
- intercube_allreduce: add single-cube fast path that skips intra-SIP mesh reduce and goes directly to inter-SIP exchange. Fixes IPCQ deadlock when TP launches kernel on one cube per SIP. - distributed.py: derive effective cube dims from tensor shard placement instead of hardcoding topology mesh size. - pyproject.toml: add matplotlib>=3.7 to dependencies. - pe_dma.py (prior commit): add MMU translation in pipeline DMA path. 577 passed, 0 failed (was 529 passed, 10 failed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2.5 KiB
2.5 KiB
| 1 | algorithm | sip_topology | n_sips | n_elem | bytes_per_pe | bytes_per_sip | latency_ns |
|---|---|---|---|---|---|---|---|
| 2 | intercube_allreduce | mesh_2d_no_wrap | 6 | 8 | 16 | 256 | 3508.4249999999993 |
| 3 | intercube_allreduce | mesh_2d_no_wrap | 6 | 32 | 64 | 1024 | 3515.55 |
| 4 | intercube_allreduce | mesh_2d_no_wrap | 6 | 64 | 128 | 2048 | 3525.0499999999975 |
| 5 | intercube_allreduce | mesh_2d_no_wrap | 6 | 128 | 256 | 4096 | 3544.049999999992 |
| 6 | intercube_allreduce | mesh_2d_no_wrap | 6 | 512 | 1024 | 16384 | 3667.049999999992 |
| 7 | intercube_allreduce | mesh_2d_no_wrap | 6 | 1024 | 2048 | 32768 | 3837.049999999992 |
| 8 | intercube_allreduce | mesh_2d_no_wrap | 6 | 2048 | 4096 | 65536 | 4177.049999999992 |
| 9 | intercube_allreduce | mesh_2d_no_wrap | 6 | 4096 | 8192 | 131072 | 4857.049999999959 |
| 10 | intercube_allreduce | mesh_2d_no_wrap | 6 | 8192 | 16384 | 262144 | 6217.049999999945 |
| 11 | intercube_allreduce | mesh_2d_no_wrap | 6 | 16384 | 32768 | 524288 | 8937.049999999937 |
| 12 | intercube_allreduce | mesh_2d_no_wrap | 6 | 32768 | 65536 | 1048576 | 14377.049999999872 |
| 13 | intercube_allreduce | mesh_2d_no_wrap | 6 | 49152 | 98304 | 1572864 | 19817.049999999872 |
| 14 | intercube_allreduce | ring_1d | 6 | 8 | 16 | 256 | 3073.1299999999937 |
| 15 | intercube_allreduce | ring_1d | 6 | 32 | 64 | 1024 | 3079.8799999999947 |
| 16 | intercube_allreduce | ring_1d | 6 | 64 | 128 | 2048 | 3088.879999999992 |
| 17 | intercube_allreduce | ring_1d | 6 | 128 | 256 | 4096 | 3106.8799999999865 |
| 18 | intercube_allreduce | ring_1d | 6 | 512 | 1024 | 16384 | 3225.8799999999865 |
| 19 | intercube_allreduce | ring_1d | 6 | 1024 | 2048 | 32768 | 3391.8799999999865 |
| 20 | intercube_allreduce | ring_1d | 6 | 2048 | 4096 | 65536 | 3723.8799999999865 |
| 21 | intercube_allreduce | ring_1d | 6 | 4096 | 8192 | 131072 | 4387.879999999965 |
| 22 | intercube_allreduce | ring_1d | 6 | 8192 | 16384 | 262144 | 5715.879999999957 |
| 23 | intercube_allreduce | ring_1d | 6 | 16384 | 32768 | 524288 | 8371.879999999932 |
| 24 | intercube_allreduce | ring_1d | 6 | 32768 | 65536 | 1048576 | 13683.879999999903 |
| 25 | intercube_allreduce | ring_1d | 6 | 49152 | 98304 | 1572864 | 18995.879999999917 |
| 26 | intercube_allreduce | torus_2d | 6 | 8 | 16 | 256 | 2190.4799999999923 |
| 27 | intercube_allreduce | torus_2d | 6 | 32 | 64 | 1024 | 2196.479999999993 |
| 28 | intercube_allreduce | torus_2d | 6 | 64 | 128 | 2048 | 2204.4799999999905 |
| 29 | intercube_allreduce | torus_2d | 6 | 128 | 256 | 4096 | 2220.479999999985 |
| 30 | intercube_allreduce | torus_2d | 6 | 512 | 1024 | 16384 | 2325.479999999985 |
| 31 | intercube_allreduce | torus_2d | 6 | 1024 | 2048 | 32768 | 2471.479999999985 |
| 32 | intercube_allreduce | torus_2d | 6 | 2048 | 4096 | 65536 | 2763.479999999985 |
| 33 | intercube_allreduce | torus_2d | 6 | 4096 | 8192 | 131072 | 3347.4799999999777 |
| 34 | intercube_allreduce | torus_2d | 6 | 8192 | 16384 | 262144 | 4515.4799999999705 |
| 35 | intercube_allreduce | torus_2d | 6 | 16384 | 32768 | 524288 | 6851.479999999952 |
| 36 | intercube_allreduce | torus_2d | 6 | 32768 | 65536 | 1048576 | 11523.479999999923 |
| 37 | intercube_allreduce | torus_2d | 6 | 49152 | 98304 | 1572864 | 16195.479999999952 |