Test matrix restructure:
- 256-rank full-system ring runs only ONCE (marked pytest.mark.slow)
instead of 7× across matrix + perf tests. Cross-SIP routing is
verified by the single run; buffer variants (tcm/hbm/sram) are
tested at 8-rank where they finish in <0.5s.
- Performance tests use 8-rank instead of 256-rank.
- `pytest -m "not slow"` completes in ~2.5min (local dev).
- Full suite including slow: ~6min (CI).
DataExecutor optimization:
- Remove ThreadPoolExecutor from DataExecutor.run(). Same-t_start
groups are almost always size 1, so the thread pool creation and
dispatch overhead dominated. Simple sequential loop is faster.
- Skip dma_read ops at the loop level (they are always no-ops in
Phase 2 but were dispatched through _execute_op → _execute_memory).
- Remove redundant CLI Phase 2 re-execution: engine._flush_data_phase
already replays during engine.wait(); the CLI now only prints the
diagnostic summary without re-running DataExecutor.
502 tests pass. Wall time: 25m30s → 5m43s (full), 2m28s (no slow).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
tl.program_id(axis=0) returns local PE id within cube,
tl.program_id(axis=1) returns cube id. Enables cube-aware
sharding in benchmark kernels.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- CLI: --verify-data flag enables Phase 2 data verification (ADR-0020)
- Tensor.data: returns actual numpy values (verify-data) or zeros placeholder
- Tensor.__repr__: shows value summary or data=N/A (placeholder)
- DataExecutor: ThreadPoolExecutor for same-timestamp parallel op execution
- BenchResult.engine: exposes op_log/memory_store for Phase 2 access
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Component placement uses mm coordinates in topology.yaml, mesh_gen
finds the nearest router automatically. M_CPU moved to pos_mm=[7.5,2.0]
(→ r0c2), SRAM at pos_mm=[1.5,9.0] (→ r3c0).
No hardcoded router references in topology config.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Dedicated cube_view renderer showing 6×6 router grid with attachments
- PE blocks drawn next to their router (above/below)
- HBM pseudo channel port bar (64 ports, color-coded by PE owner)
- Per-PE BW annotations on HBM links
- Router color-coded by type (PE/M_CPU/SRAM/UCIe/relay)
- Title shows mode, channel count, per-PE and total BW
- Legend for all component types
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- cube_view now renders all 32 router nodes from cube_mesh.yaml
instead of collapsed "router_mesh" placeholder
- Fix mesh_gen row Y position overlap (r1/r2 and r3/r4 had same Y)
by adding hbm_gap spacing between PE rows and HBM zone
- Add noc_router to visualizer KIND_SIZE for proper sizing
- Update cube view tests for individual router nodes
339 passed
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add router → PE_MMU edge so MmuMapMsg can reach PE_MMU via
the router mesh. Unskip all PE_MMU fabric tests.
339 passed, 0 skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RuntimeContext._ensure_allocators() now limits SIP range to
target_device (single SIP or all). Prevents cross-SIP tensor
deployment that caused PE_TCM routing errors.
Also accept 'sip0' format (without colon) in DeviceSelector.
331 passed, 8 skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove xbar_top/bot, bridge, single noc node from topology
- Each cube_mesh.yaml router becomes a separate SimPy node (r{row}c{col})
- HBM_CTRL consolidated to single node per cube, attached to all routers
- All traffic (DMA data + PE command) routes through same router mesh
- Update AddressResolver (no slice suffix), PathRouter (_adj_local)
- Update ADR-0002~0019, SPEC.md to remove xbar/bridge references
- Regenerate SVG diagrams for new topology structure
- Skip cross-SIP PE_TCM and PE_MMU routing tests (not yet wired)
326 passed, 13 skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Model fabric response hop latency for PE-internal operations:
- HBM_CTRL sends PeDmaMsg response on reverse path instead of direct done signal
- PE_CPU sends ResponseMsg via NOC→M_CPU on kernel completion
- Add NOC→PE_DMA and PE_CPU→NOC edges in topology builder
- Make HBM BW test assertions dynamic based on topology efficiency
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>