4ba0a83e71
Scope (Phase A): - D1: world_size fallback = SIP count (rank = SIP, TP boundary) - D9: greenlet-local get_rank + _bind_rank (single-driver fallback = 0) - D10: torch.ahbm.set_device + torch.accelerator.set_device_index alias - D11: tensor placement scoped to current-device SIP (post-hoc pe_index shift — ADR-0026 replaces with structural coords) - D12/D13: multi-greenlet run() with simple round-robin scheduler; hybrid dispatch (ws == SIP count → multi-greenlet, else legacy single-worker for ccl.yaml override compat) - D7 partial: backend.all_reduce submit + yield + wait via launch()'s new _defer_wait flag; parent-less greenlets skip yield - Relaxed shard-count check (len(shards) > 0 instead of == world_size) - rank_to_pe = SIP-representative [(r, 0, 0)] when ws <= n_sips Deferred to Phase B: - Engine-routed install (D2) — keeps sideband - install_plan.py module (D6) — keeps install.py - Epoch barrier (D7 full) — simple yield is sufficient for ring ws=2 mock - Validator registry (D8) - Cross-SIP multi-greenlet + real kernel integration — matrix ring_default_ws hangs in SimPy drain despite ADR-0025 direction fix; marked xfail(run=False) pending Phase B diagnosis (suspected per-rank kernel_args / program_id mismatch) Tests: - test_ccl_ddp_launcher.py (6 new tests) — D1/D9/D10/D11/D12/D13 - test_ccl_allreduce_matrix.py — ring_default_ws xfail'd, override cases (ring_tcm_8 / hbm_8 / sram_8 / multi_cube / mesh_2x2 / tree_binary_7) all pass via legacy path 514 tests pass, 1 xfail. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
33 lines
286 B
Plaintext
33 lines
286 B
Plaintext
# OS / Editor
|
|
.DS_Store
|
|
.vscode/.history/
|
|
*.swp
|
|
|
|
# Auto-generated mesh file
|
|
cube_mesh.yaml
|
|
|
|
# Python
|
|
__pycache__/
|
|
*.py[cod]
|
|
*.pyd
|
|
.pytest_cache/
|
|
.mypy_cache/
|
|
.ruff_cache/
|
|
|
|
# Virtualenv
|
|
.venv/
|
|
|
|
# Packaging
|
|
dist/
|
|
build/
|
|
*.egg-info/
|
|
|
|
# Env
|
|
.env
|
|
.env.*
|
|
!.env.example
|
|
|
|
# Logs
|
|
*.log
|
|
.claude/
|