Fix all remaining test failures: single-cube allreduce + matplotlib dep

- intercube_allreduce: add single-cube fast path that skips intra-SIP
  mesh reduce and goes directly to inter-SIP exchange. Fixes IPCQ
  deadlock when TP launches kernel on one cube per SIP.
- distributed.py: derive effective cube dims from tensor shard placement
  instead of hardcoding topology mesh size.
- pyproject.toml: add matplotlib>=3.7 to dependencies.
- pe_dma.py (prior commit): add MMU translation in pipeline DMA path.

577 passed, 0 failed (was 529 passed, 10 failed).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

This commit is contained in:

Yangwook Kang

2026-04-27 21:25:31 -07:00

parent d55dc6cb4f

commit fca24feac5

15 changed files with 112 additions and 95 deletions

tests/allreduce_latency_plots/topology.png

BIN

View File

Binary file not shown.

Before

Width: | Height: | Size: 194 KiB

After

Width: | Height: | Size: 194 KiB