kernbench2/tests/allreduce_latency_plots/topology.png at fca24feac5ab77515621bab1521f14f49182e84a

Files

T

ywkang fca24feac5 Fix all remaining test failures: single-cube allreduce + matplotlib dep

- intercube_allreduce: add single-cube fast path that skips intra-SIP
  mesh reduce and goes directly to inter-SIP exchange. Fixes IPCQ
  deadlock when TP launches kernel on one cube per SIP.
- distributed.py: derive effective cube dims from tensor shard placement
  instead of hardcoding topology mesh size.
- pyproject.toml: add matplotlib>=3.7 to dependencies.
- pe_dma.py (prior commit): add MMU translation in pipeline DMA path.

577 passed, 0 failed (was 529 passed, 10 failed).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-27 21:25:31 -07:00

194 KiB

1618x1157px

Raw History

/ywkang/kernbench2/raw/commit/fca24feac5ab77515621bab1521f14f49182e84a/tests/allreduce_latency_plots/topology.png

194 KiB 1618x1157px Raw History

194 KiB

1618x1157px

Raw History