Commit Graph

5 Commits

Author SHA1 Message Date
ywkang fca24feac5 Fix all remaining test failures: single-cube allreduce + matplotlib dep
- intercube_allreduce: add single-cube fast path that skips intra-SIP
  mesh reduce and goes directly to inter-SIP exchange. Fixes IPCQ
  deadlock when TP launches kernel on one cube per SIP.
- distributed.py: derive effective cube dims from tensor shard placement
  instead of hardcoding topology mesh size.
- pyproject.toml: add matplotlib>=3.7 to dependencies.
- pe_dma.py (prior commit): add MMU translation in pipeline DMA path.

577 passed, 0 failed (was 529 passed, 10 failed).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-27 21:25:31 -07:00
ywkang 372c987995 Reduce test time to 12s: shrink GEMM dims + enable pytest-xdist
GEMM dimension reduction:
- qkv_gemm.py: M,K,N = 128,256,128 → 32,64,32 (64 tiles → 1 tile).
- qkv_gemm_multi_pe.py: same reduction.
- Tests verify pipeline correctness, not large-matrix throughput.
- Per-test time: 18s → 1.7s. 6 tests total: 108s → 10s.

pytest-xdist parallel execution:
- Add pytest-xdist to dev dependencies.
- pyproject.toml addopts: -n auto (use all CPU cores), -m "not slow".
- Default `pytest` runs 501 tests in ~12s (previously 148s).
- Full suite including slow: `pytest -m ""` → 3m24s (previously 5m43s).

pytest.mark.slow:
- Registered in pyproject.toml markers section.
- 256-rank full-system test is the only slow-marked test.
- Run with: pytest -m "" (CI) or pytest (local dev, skips slow).

502 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 21:06:41 -07:00
ywkang 51004c311c Implement ADR-0020: 2-pass data execution with greenlet kernel runner
Step 1 — Foundation:
- OpRecord/OpLogger: op log infrastructure with t_start stable ordering
- MemoryStore: numpy ndarray tensor-granular storage (reference semantics)
- data_op=True flag on DmaReadCmd, DmaWriteCmd, GemmCmd, MathCmd, CompositeCmd
- numpy/greenlet dependencies added to pyproject.toml

Step 2 — ComponentBase hooks:
- _on_process_start/end hooks in _forward_txn (fabric messages)
- _handle_with_hooks in PeEngineBase (PE-internal commands)
- op_logger optional — zero overhead when disabled

Step 3 — KernelRunner + greenlet:
- KernelRunner: greenlet ↔ SimPy bridge in triton_emu/kernel_runner.py
- TLContext: _emit() method routes to greenlet switch or command list
- tl.load() returns real numpy data in greenlet mode
- Dynamic control flow supported (memory-read based branching)

Step 4 — PE_CPU integration:
- Greenlet mode when ctx.memory_store is set, legacy fallback otherwise
- Refactored into _execute_greenlet/_execute_legacy/_send_response
- ComponentContext gains memory_store and op_logger fields

Step 5 — DataExecutor:
- Phase 2 numpy execution for GEMM/Math ops from op_log
- _compute_math: all unary/binary/reduction ops
- verify(): compare MemoryStore against expected with dtype tolerance

28 new tests, 366 total passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 00:22:44 -07:00
ywkang dcbc41571f Add web topology viewer with hot path visualization
- FastAPI backend (server.py) with REST API + WebSocket for event streaming
- SVG-based topology viewer (index.html) with SIP/CUBE/PE drill-down views
- Event logging infrastructure (event_log.py) generating events from real
  probe cases (H2D/D2H/PE-DMA) and bench workloads (QKV GEMM single/multi-PE)
- Timeline replay engine with play/pause, speed control, and scrubbing
- Workload selector dropdown grouped by category (Probe/Bench)
- CLI entry points: kernbench web, ./kernbench wrapper scripts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 03:19:19 -07:00
ywkang 6f43807900 commit - release 1 2026-03-18 11:47:48 -07:00