Fix all remaining test failures: single-cube allreduce + matplotlib dep
- intercube_allreduce: add single-cube fast path that skips intra-SIP mesh reduce and goes directly to inter-SIP exchange. Fixes IPCQ deadlock when TP launches kernel on one cube per SIP. - distributed.py: derive effective cube dims from tensor shard placement instead of hardcoding topology mesh size. - pyproject.toml: add matplotlib>=3.7 to dependencies. - pe_dma.py (prior commit): add MMU translation in pipeline DMA path. 577 passed, 0 failed (was 529 passed, 10 failed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -113,7 +113,18 @@ class AhbmCCLBackend:
|
||||
)
|
||||
n_elem = shards[0].nbytes // tensor.itemsize
|
||||
kernel_fn = self._algo_module.kernel
|
||||
kernel_args = self._algo_module.kernel_args(self._world_size, n_elem)
|
||||
# Derive effective cube dims from tensor's actual shard placement
|
||||
# (may differ from topology mesh when TP uses fewer cubes).
|
||||
sip0_cubes = sorted({s.cube for s in shards if s.sip == shards[0].sip})
|
||||
eff_n_cubes = len(sip0_cubes) if sip0_cubes else 1
|
||||
if eff_n_cubes == 1:
|
||||
eff_cube_w, eff_cube_h = 1, 1
|
||||
else:
|
||||
eff_cube_w, eff_cube_h = self._cube_w, self._cube_h
|
||||
kernel_args = self._algo_module.kernel_args(
|
||||
self._world_size, n_elem,
|
||||
cube_w=eff_cube_w, cube_h=eff_cube_h,
|
||||
)
|
||||
|
||||
# Resolve sip_rank from the current greenlet's bound rank
|
||||
from greenlet import getcurrent as _gc
|
||||
|
||||
Reference in New Issue
Block a user