Fix all remaining test failures: single-cube allreduce + matplotlib dep

- intercube_allreduce: add single-cube fast path that skips intra-SIP
  mesh reduce and goes directly to inter-SIP exchange. Fixes IPCQ
  deadlock when TP launches kernel on one cube per SIP.
- distributed.py: derive effective cube dims from tensor shard placement
  instead of hardcoding topology mesh size.
- pyproject.toml: add matplotlib>=3.7 to dependencies.
- pe_dma.py (prior commit): add MMU translation in pipeline DMA path.

577 passed, 0 failed (was 529 passed, 10 failed).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-27 21:25:31 -07:00
parent d55dc6cb4f
commit fca24feac5
15 changed files with 112 additions and 95 deletions
+12 -1
View File
@@ -113,7 +113,18 @@ class AhbmCCLBackend:
)
n_elem = shards[0].nbytes // tensor.itemsize
kernel_fn = self._algo_module.kernel
kernel_args = self._algo_module.kernel_args(self._world_size, n_elem)
# Derive effective cube dims from tensor's actual shard placement
# (may differ from topology mesh when TP uses fewer cubes).
sip0_cubes = sorted({s.cube for s in shards if s.sip == shards[0].sip})
eff_n_cubes = len(sip0_cubes) if sip0_cubes else 1
if eff_n_cubes == 1:
eff_cube_w, eff_cube_h = 1, 1
else:
eff_cube_w, eff_cube_h = self._cube_w, self._cube_h
kernel_args = self._algo_module.kernel_args(
self._world_size, n_elem,
cube_w=eff_cube_w, cube_h=eff_cube_h,
)
# Resolve sip_rank from the current greenlet's bound rank
from greenlet import getcurrent as _gc