Phase B Option A (freeze + defer to ADR-0027): the root cause of
ring_default_ws strict-xfail is that bench workers call torch.zeros /
copy_ which drive env.run in the WORKER-greenlet context. Any pending
KernelLaunchMsg gets stepped inside that worker, spawning kernel_runner
with parent = worker (not main). When the worker yields/finishes, the
kernel greenlet is orphaned and its next switch_to_simpy raises
GreenletExit mid-add — producing rank 0 mean=1 (expected 3).
This is a larger architectural redesign (lazy-deploy tensor API,
coroutine worker, or setup/verify split) and is parked until ADR-0027
(Megatron TP) starts, where the proper solution ships with TP use cases.
No production changes; xfail reason + inline comment only.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause (hang diagnosis):
`kernel_runner.run()` captures `greenlet.getcurrent()` at spawn time as
the kernel greenlet's `_parent`. When a worker greenlet (say g0) calls
`dist.all_reduce` → `ctx.wait(h)` → `env.run(until=h0)`, the SimPy
scheduler steps pe_cpu processes, which in turn spawn kernel greenlets.
Those kernels' `_parent` becomes g0 (current greenlet at spawn). When a
kernel yields via switch_to_simpy, control jumps back up to g0's LAST
switch point — which is the main scheduler's `g.switch()` call — rather
than the kernel_runner's generator frame. Main then re-enters its
`for g in alive: g.switch()` loop mid-wait, producing nested greenlet
re-entry. Scheduler spins: g0 never completes, g1 appears to complete
out of order, infinite loop at 100% CPU.
Fix:
- AhbmCCLBackend.all_reduce: in multi-greenlet mode, submit via
launch(_defer_wait=True), extend backend._pending_collective_handles,
and yield to the parent greenlet. Worker does NOT call wait.
- benches/ccl_allreduce.py run(): after each scheduler round, the MAIN
greenlet drains backend._pending_collective_handles. This keeps
env.run invocation in the main context, so kernel_runner's spawned
kernel greenlets have main as their _parent — no nested re-entry.
- Legacy single-driver path (no bench scheduler): all_reduce falls back
to inline wait when g.parent is None.
Result:
- Multi-greenlet cross-SIP ring no longer hangs (was 100% CPU infinite
loop in kernel_runner._switch_kernel).
- ring_default_ws still xfail(strict=True): now fails as a data
correctness issue — DataExecutor reports only 1 math op for a 2-rank
ring (expected 2). Cross-SIP op_log replay integration is the
remaining Phase B task.
514 passed, 1 xfailed (strict).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2-rank bidirectional ring deadlock: when E and W neighbors point to the
same peer, sender-coord matching in _handle_meta_arrival / _credit_worker
picked the first direction in dict order, landing data in the wrong rx
slot relative to what the kernel recv(W) was waiting on.
Fix (ADR-0025 D1/D2/D3):
- install.reverse_direction: prefer OPPOSITE direction (E↔W, N↔S) when
peer has it pointing back to us; fallback to any matching for
topologies without opposite convention (tree_binary parent/child).
- _handle_meta_arrival: match by token.dst_addr range against each qp's
my_rx_base_pa + n_slots × slot_size window (unambiguous).
- _credit_worker: match by credit.dst_rx_base_pa == qp.peer.rx_base_pa.
- IpcqCreditMetadata: new dst_rx_base_pa field carrying receiver-side
rx base; _delayed_credit_send fills it from the consuming qp.
Tests (Phase 1 → Phase 2):
- test_reverse_direction_opposite_preference_2rank_ring
- test_reverse_direction_opposite_preference_4rank_ring_sanity
- test_meta_arrival_matches_by_dst_addr_same_peer
- test_credit_matches_by_dst_rx_base_pa_same_peer
- Existing credit-return test updated with dst_rx_base_pa.
508 tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tensor.__setitem__ / __getitem__:
- Shard-aligned slice assignment and read on deployed tensors.
- Scalar broadcast and numpy array assignment supported.
- Cross-shard slices raise NotImplementedError (use copy_ for that).
- 3 new tests: single-PE, multi-PE, cross-shard error case.
Hierarchical all-reduce kernel (src/kernbench/ccl/algorithms/):
- 3-level reduce: intra-cube (E/W) → inter-cube (N/S) → inter-SIP (parent).
- Bidirectional ring reduce at each level: ceil((N-1)/2) rounds.
Left half sends via dir_dec, right half via dir_inc (wrap).
Representative receives from both sides.
- Chain broadcast for reverse path: cube 0 PE 0 → all PE 0s → all PEs.
- Registered in ccl.yaml as "hierarchical_allreduce" with topology: none
(neighbors() override builds the full 3-level neighbor map).
- kernel_args derives pes_per_cube/cubes_per_sip/num_sips from world_size.
- Mock-verified at 8/16/32/64/128 ranks.
Mock runtime fixes:
- Direction pairing: explicit N↔S, E↔W, parent↔parent instead of
"first matching reverse". Fixes 2-element rings where N and S both
point to the same peer.
- Deadlock detection: send-counter based (not just queue-depth-total)
to catch chain reductions where send+recv pairs net to zero.
- Multi-cube program_id: pes_per_cube parameter enables
program_id(axis=0) = PE within cube, program_id(axis=1) = cube id.
Legacy single-cube tests unaffected (default = world_size).
504 tests pass in 12s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: In ring all-reduce, PE_IPCQ's recv handler advances my_tail
and issues a credit return immediately. With tight credit latency
(0.12ns intra-cube), the sender can refill the slot BEFORE the
receiver's outbound PE_DMA reads from it for the next send. The
outbound snapshot then captures stale data from a later round.
Fix: Propagate TensorHandle.data (captured at recv-time, before credit
return) through the entire send chain:
tl.send(src=handle) → IpcqSendCmd.data → IpcqDmaToken.data
PE_DMA outbound already prefers token.data over MemoryStore read, so
the recv-time snapshot is used for the in-flight data. This eliminates
the race: the snapshot is captured before the slot can be overwritten.
Additional fixes:
- PE_MATH handle_command: compute SIMD latency from output tensor
element count via _compute_ns(), using max(overhead_ns, compute_ns).
Previously used overhead_ns=0.0 for all standalone MathCmd, making
math ops take 0ns in SimPy.
- DataExecutor secondary sort: same-t_start ops sorted by op_kind
(memory < gemm < math) so IPCQ slot writes execute before math reads.
- ipcq_copy recorded at INBOUND time (receiver PE_DMA arrival) instead
of outbound. Inbound time is after fabric propagation, so it sorts
correctly relative to the receiver's math.
- record_copy accepts explicit snapshot parameter (from token.data).
Result: N_ELEM=32 + 256-rank + n_slots=4 + cross-SIP now passes.
n_slots reverted to 4 (the deeper buffer was a workaround, not needed).
502 tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Provides a shared `topology` fixture that caches the parsed
topology.yaml result per pytest-xdist worker session. Tests that
build a GraphEngine can accept `topology` instead of calling
resolve_topology("topology.yaml") repeatedly.
Topology parsing costs ~32ms, so the practical saving per worker is
modest (<1s across all tests). The fixture is mainly for architectural
cleanliness — keeping the "parse once, build engine many" pattern
explicit.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test matrix restructure:
- 256-rank full-system ring runs only ONCE (marked pytest.mark.slow)
instead of 7× across matrix + perf tests. Cross-SIP routing is
verified by the single run; buffer variants (tcm/hbm/sram) are
tested at 8-rank where they finish in <0.5s.
- Performance tests use 8-rank instead of 256-rank.
- `pytest -m "not slow"` completes in ~2.5min (local dev).
- Full suite including slow: ~6min (CI).
DataExecutor optimization:
- Remove ThreadPoolExecutor from DataExecutor.run(). Same-t_start
groups are almost always size 1, so the thread pool creation and
dispatch overhead dominated. Simple sequential loop is faster.
- Skip dma_read ops at the loop level (they are always no-ops in
Phase 2 but were dispatched through _execute_op → _execute_memory).
- Remove redundant CLI Phase 2 re-execution: engine._flush_data_phase
already replays during engine.wait(); the CLI now only prints the
diagnostic summary without re-running DataExecutor.
502 tests pass. Wall time: 25m30s → 5m43s (full), 2m28s (no slow).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
tl.program_id(axis=0) returns local PE id within cube,
tl.program_id(axis=1) returns cube id. Enables cube-aware
sharding in benchmark kernels.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- CLI: --verify-data flag enables Phase 2 data verification (ADR-0020)
- Tensor.data: returns actual numpy values (verify-data) or zeros placeholder
- Tensor.__repr__: shows value summary or data=N/A (placeholder)
- DataExecutor: ThreadPoolExecutor for same-timestamp parallel op execution
- BenchResult.engine: exposes op_log/memory_store for Phase 2 access
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Design for actual data storage/computation in HBM/TCM/SRAM components:
- Phase 1: SimPy timing + MemoryStore (memory ops data-aware via greenlet)
- Phase 2: op_log-based numpy execution for GEMM/Math verification
- Greenlet-based KernelRunner replaces Phase 0 command list generation
- tl.load() returns real data in Phase 1, enabling memory-based control flow
- ComponentBase hook for op logging (single source of truth)
- MemoryStore: numpy ndarray tensor-granular storage with reference semantics
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Component placement uses mm coordinates in topology.yaml, mesh_gen
finds the nearest router automatically. M_CPU moved to pos_mm=[7.5,2.0]
(→ r0c2), SRAM at pos_mm=[1.5,9.0] (→ r3c0).
No hardcoded router references in topology config.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stub length increased to 12px (PE/HBM) and 10px (UCIe).
Gap between router and component increased to 30px so both
45° stubs (router end + component end) are clearly visible.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- M_CPU placed north (above) its router
- All connectors: 45° stub from router → straight → 45° stub to component
- Consistent 4-point polyline pattern for PE, M_CPU, SRAM, HBM, UCIe
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
UCIe position calculated with minimal inset (0.3 × size) to
place components flush against cube boundary edges.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All connectors now start with 45° diagonal from router edge,
then go straight (vertical/horizontal) to the component block.
Applies to PE, M_CPU/SRAM, PE→HBM, and UCIe connectors.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Router-router mesh links remain straight (horizontal/vertical).
All component→router connectors use 45° L-bend polylines:
- PE blocks: vertical then 45° diagonal to router
- M_CPU/SRAM: horizontal then 45° diagonal to router
- PE→HBM port group: vertical then 45° diagonal
- UCIe port→router: direction-aware 45° bend
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
UCIe components placed at defined positions from _cube_local_positions
with size from cube.geometry.ucie_mm.size. N/S horizontal, E/W vertical.
Connection ports rendered as color-coded boxes inside UCIe component.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- UCIe-N/S/E/W drawn as component blocks inside cube boundary
(inset 3mm from edge)
- Each UCIe has c0-c3 connection ports as color-coded boxes inside
- Connector lines from each port box to its attached router
- Removed old UCIe rendering that placed blocks outside cube
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- HBM pseudo channel ports split to top/bottom edges of HBM zone
(32 ports each, 8 per PE, color-coded)
- PE→HBM lines connect router to its port group center
- Per-PE label: "PE0×8ch" with BW annotation
- UCIe blocks flush against cube edges at router positions
- UCIe blocks smaller (22×10px)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- HBM connection lines angled 30% toward HBM center (not vertical)
to distinguish from mesh links
- M_CPU/SRAM blocks placed to the left of their router
with horizontal connector lines (avoid mesh overlap)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Draw HBM connection lines last (on top of component blocks).
PE routers: thicker (1.5px, opacity 0.6) with dashed style.
Relay routers: thinner (0.7px, opacity 0.2).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All router-attached components (PE, M_CPU, SRAM, UCIe) rendered as
labeled blocks with explicit connector lines to their router.
UCIe blocks positioned at cube edges matching port direction.
Router→HBM_CTRL lines shown for all 32 routers.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Dedicated cube_view renderer showing 6×6 router grid with attachments
- PE blocks drawn next to their router (above/below)
- HBM pseudo channel port bar (64 ports, color-coded by PE owner)
- Per-PE BW annotations on HBM links
- Router color-coded by type (PE/M_CPU/SRAM/UCIe/relay)
- Title shows mode, channel count, per-PE and total BW
- Legend for all component types
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Shrink cube-view component nodes to avoid clutter.
HBM and router_mesh edge lines made thinner and more transparent.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PE nodes are shifted 1.2mm above (top half) or below (bottom half)
their assigned router position. PE size reduced to 1.4x0.7mm.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- cube_view now renders all 32 router nodes from cube_mesh.yaml
instead of collapsed "router_mesh" placeholder
- Fix mesh_gen row Y position overlap (r1/r2 and r3/r4 had same Y)
by adding hbm_gap spacing between PE rows and HBM zone
- Add noc_router to visualizer KIND_SIZE for proper sizing
- Update cube view tests for individual router nodes
339 passed
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add router → PE_MMU edge so MmuMapMsg can reach PE_MMU via
the router mesh. Unskip all PE_MMU fabric tests.
339 passed, 0 skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RuntimeContext._ensure_allocators() now limits SIP range to
target_device (single SIP or all). Prevents cross-SIP tensor
deployment that caused PE_TCM routing errors.
Also accept 'sip0' format (without colon) in DeviceSelector.
331 passed, 8 skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove all xbar/bridge rendering from cube detail view.
Replace 8 HBM slices with single HBM_CTRL block.
Add green dotted lines showing router-to-HBM connectivity.
Update legend, event animation, and PE view NOC destinations.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove xbar_top/bot, bridge, single noc node from topology
- Each cube_mesh.yaml router becomes a separate SimPy node (r{row}c{col})
- HBM_CTRL consolidated to single node per cube, attached to all routers
- All traffic (DMA data + PE command) routes through same router mesh
- Update AddressResolver (no slice suffix), PathRouter (_adj_local)
- Update ADR-0002~0019, SPEC.md to remove xbar/bridge references
- Regenerate SVG diagrams for new topology structure
- Skip cross-SIP PE_TCM and PE_MMU routing tests (not yet wired)
326 passed, 13 skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ADR-0018: LA replaces VA, BAAW segment-based mapping in PE_DMA,
1:1 (per-channel) and n:1 (aggregated) modes with parameterized
channel count.
ADR-0019: xbar/bridge removal, channel router topology with
horizontal line layout, aggregated router for n:1 mode,
unified NOC path for local/remote HBM access.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add cycle-accurate PE accelerator scheduler (SchedulerV2) with tiled
GEMM/Math pipelines (DMA_IN → GEMM → MATH → DMA_WB)
- Add DPPolicy num_pes/num_cubes/num_sips overrides for single-PE testing
- Support tuple target_pe for targeting specific PE subsets
- Add gemm_single_pe and gpt3_qkv benchmarks
- Switch default topology to pe_scheduler_v2
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>