Provides a shared `topology` fixture that caches the parsed
topology.yaml result per pytest-xdist worker session. Tests that
build a GraphEngine can accept `topology` instead of calling
resolve_topology("topology.yaml") repeatedly.
Topology parsing costs ~32ms, so the practical saving per worker is
modest (<1s across all tests). The fixture is mainly for architectural
cleanliness — keeping the "parse once, build engine many" pattern
explicit.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test matrix restructure:
- 256-rank full-system ring runs only ONCE (marked pytest.mark.slow)
instead of 7× across matrix + perf tests. Cross-SIP routing is
verified by the single run; buffer variants (tcm/hbm/sram) are
tested at 8-rank where they finish in <0.5s.
- Performance tests use 8-rank instead of 256-rank.
- `pytest -m "not slow"` completes in ~2.5min (local dev).
- Full suite including slow: ~6min (CI).
DataExecutor optimization:
- Remove ThreadPoolExecutor from DataExecutor.run(). Same-t_start
groups are almost always size 1, so the thread pool creation and
dispatch overhead dominated. Simple sequential loop is faster.
- Skip dma_read ops at the loop level (they are always no-ops in
Phase 2 but were dispatched through _execute_op → _execute_memory).
- Remove redundant CLI Phase 2 re-execution: engine._flush_data_phase
already replays during engine.wait(); the CLI now only prints the
diagnostic summary without re-running DataExecutor.
502 tests pass. Wall time: 25m30s → 5m43s (full), 2m28s (no slow).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
tl.program_id(axis=0) returns local PE id within cube,
tl.program_id(axis=1) returns cube id. Enables cube-aware
sharding in benchmark kernels.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- CLI: --verify-data flag enables Phase 2 data verification (ADR-0020)
- Tensor.data: returns actual numpy values (verify-data) or zeros placeholder
- Tensor.__repr__: shows value summary or data=N/A (placeholder)
- DataExecutor: ThreadPoolExecutor for same-timestamp parallel op execution
- BenchResult.engine: exposes op_log/memory_store for Phase 2 access
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Design for actual data storage/computation in HBM/TCM/SRAM components:
- Phase 1: SimPy timing + MemoryStore (memory ops data-aware via greenlet)
- Phase 2: op_log-based numpy execution for GEMM/Math verification
- Greenlet-based KernelRunner replaces Phase 0 command list generation
- tl.load() returns real data in Phase 1, enabling memory-based control flow
- ComponentBase hook for op logging (single source of truth)
- MemoryStore: numpy ndarray tensor-granular storage with reference semantics
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Component placement uses mm coordinates in topology.yaml, mesh_gen
finds the nearest router automatically. M_CPU moved to pos_mm=[7.5,2.0]
(→ r0c2), SRAM at pos_mm=[1.5,9.0] (→ r3c0).
No hardcoded router references in topology config.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stub length increased to 12px (PE/HBM) and 10px (UCIe).
Gap between router and component increased to 30px so both
45° stubs (router end + component end) are clearly visible.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- M_CPU placed north (above) its router
- All connectors: 45° stub from router → straight → 45° stub to component
- Consistent 4-point polyline pattern for PE, M_CPU, SRAM, HBM, UCIe
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
UCIe position calculated with minimal inset (0.3 × size) to
place components flush against cube boundary edges.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All connectors now start with 45° diagonal from router edge,
then go straight (vertical/horizontal) to the component block.
Applies to PE, M_CPU/SRAM, PE→HBM, and UCIe connectors.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Router-router mesh links remain straight (horizontal/vertical).
All component→router connectors use 45° L-bend polylines:
- PE blocks: vertical then 45° diagonal to router
- M_CPU/SRAM: horizontal then 45° diagonal to router
- PE→HBM port group: vertical then 45° diagonal
- UCIe port→router: direction-aware 45° bend
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
UCIe components placed at defined positions from _cube_local_positions
with size from cube.geometry.ucie_mm.size. N/S horizontal, E/W vertical.
Connection ports rendered as color-coded boxes inside UCIe component.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- UCIe-N/S/E/W drawn as component blocks inside cube boundary
(inset 3mm from edge)
- Each UCIe has c0-c3 connection ports as color-coded boxes inside
- Connector lines from each port box to its attached router
- Removed old UCIe rendering that placed blocks outside cube
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- HBM pseudo channel ports split to top/bottom edges of HBM zone
(32 ports each, 8 per PE, color-coded)
- PE→HBM lines connect router to its port group center
- Per-PE label: "PE0×8ch" with BW annotation
- UCIe blocks flush against cube edges at router positions
- UCIe blocks smaller (22×10px)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- HBM connection lines angled 30% toward HBM center (not vertical)
to distinguish from mesh links
- M_CPU/SRAM blocks placed to the left of their router
with horizontal connector lines (avoid mesh overlap)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Draw HBM connection lines last (on top of component blocks).
PE routers: thicker (1.5px, opacity 0.6) with dashed style.
Relay routers: thinner (0.7px, opacity 0.2).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All router-attached components (PE, M_CPU, SRAM, UCIe) rendered as
labeled blocks with explicit connector lines to their router.
UCIe blocks positioned at cube edges matching port direction.
Router→HBM_CTRL lines shown for all 32 routers.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Dedicated cube_view renderer showing 6×6 router grid with attachments
- PE blocks drawn next to their router (above/below)
- HBM pseudo channel port bar (64 ports, color-coded by PE owner)
- Per-PE BW annotations on HBM links
- Router color-coded by type (PE/M_CPU/SRAM/UCIe/relay)
- Title shows mode, channel count, per-PE and total BW
- Legend for all component types
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Shrink cube-view component nodes to avoid clutter.
HBM and router_mesh edge lines made thinner and more transparent.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PE nodes are shifted 1.2mm above (top half) or below (bottom half)
their assigned router position. PE size reduced to 1.4x0.7mm.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- cube_view now renders all 32 router nodes from cube_mesh.yaml
instead of collapsed "router_mesh" placeholder
- Fix mesh_gen row Y position overlap (r1/r2 and r3/r4 had same Y)
by adding hbm_gap spacing between PE rows and HBM zone
- Add noc_router to visualizer KIND_SIZE for proper sizing
- Update cube view tests for individual router nodes
339 passed
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add router → PE_MMU edge so MmuMapMsg can reach PE_MMU via
the router mesh. Unskip all PE_MMU fabric tests.
339 passed, 0 skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RuntimeContext._ensure_allocators() now limits SIP range to
target_device (single SIP or all). Prevents cross-SIP tensor
deployment that caused PE_TCM routing errors.
Also accept 'sip0' format (without colon) in DeviceSelector.
331 passed, 8 skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove all xbar/bridge rendering from cube detail view.
Replace 8 HBM slices with single HBM_CTRL block.
Add green dotted lines showing router-to-HBM connectivity.
Update legend, event animation, and PE view NOC destinations.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove xbar_top/bot, bridge, single noc node from topology
- Each cube_mesh.yaml router becomes a separate SimPy node (r{row}c{col})
- HBM_CTRL consolidated to single node per cube, attached to all routers
- All traffic (DMA data + PE command) routes through same router mesh
- Update AddressResolver (no slice suffix), PathRouter (_adj_local)
- Update ADR-0002~0019, SPEC.md to remove xbar/bridge references
- Regenerate SVG diagrams for new topology structure
- Skip cross-SIP PE_TCM and PE_MMU routing tests (not yet wired)
326 passed, 13 skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ADR-0018: LA replaces VA, BAAW segment-based mapping in PE_DMA,
1:1 (per-channel) and n:1 (aggregated) modes with parameterized
channel count.
ADR-0019: xbar/bridge removal, channel router topology with
horizontal line layout, aggregated router for n:1 mode,
unified NOC path for local/remote HBM access.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add cycle-accurate PE accelerator scheduler (SchedulerV2) with tiled
GEMM/Math pipelines (DMA_IN → GEMM → MATH → DMA_WB)
- Add DPPolicy num_pes/num_cubes/num_sips overrides for single-PE testing
- Support tuple target_pe for targeting specific PE subsets
- Add gemm_single_pe and gpt3_qkv benchmarks
- Switch default topology to pe_scheduler_v2
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Model fabric response hop latency for PE-internal operations:
- HBM_CTRL sends PeDmaMsg response on reverse path instead of direct done signal
- PE_CPU sends ResponseMsg via NOC→M_CPU on kernel completion
- Add NOC→PE_DMA and PE_CPU→NOC edges in topology builder
- Make HBM BW test assertions dynamic based on topology efficiency
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>