- KernelLaunchMsg gains target_start_ns: IO_CPU stamps a global barrier
(max path latency across every target PE), M_CPU passes it through,
PE_CPU yields until it before recording pe_exec_start. Every PE in a
launch begins kernel execution at the same env.now regardless of its
dispatch path length — eliminates per-PE dispatch-offset artifact in
cross-PE and cross-cube latency measurements.
- PE_DMA._handle_ipcq_inbound now pays Transaction.drain_ns at the top,
matching the terminal-drain behavior of ComponentBase._forward_txn for
every non-IPCQ Transaction. SRC-side tl.send stays fire-and-forget
(sender doesn't yield on sub_done); tl.recv now blocks until bytes
have actually drained into its inbox.
- ComponentContext: new compute_path_latency_ns helper + node_overhead_ns
field populated by GraphEngine.
- tests/test_kernel_launch_sync.py: asserts all PEs in one launch
produce identical pe_exec_ns for a no-op kernel (zero spread).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds tests/test_pe_to_pe_latency.py: a sweep that measures PE-to-PE
transfer latency for five hop types (intra-cube horizontal/vertical,
inter-cube horizontal/vertical, inter-SIP) across data sizes 128 B to
10 KB, on both the IPCQ (tl.send/tl.recv) and raw-DMA (tl.load+tl.store)
paths. Emits per-hop PNG plots, an overview PNG, and a CSV summary into
tests/pe2pe_latency_plots/. Latency is reported as max(pe_exec_ns) across
participating PEs, read from engine.get_completion(), so the measurement
captures the SRC/DST PE's kernel body time rather than the full launch+
response-aggregation envelope.
Two simulator fixes were needed to make this measurement meaningful:
- PeMMU now stores a list of (start, end, pa) sub-regions per page
rather than a single PA. DPPolicy layouts with shards smaller than
page_size (e.g. 128 B payloads with 4 KB pages) used to silently
overwrite each other through last-write-wins, causing DMAs intended
for cube0 to physically route to cube3 - inflating latency by ~170 ns
per DMA at small sizes. STOPGAP: real MMUs don't support sub-page
regions; long-term fix is either smaller MMU page size or DPPolicy
validation that refuses sub-page shards.
- M_CPU's per-PE metrics aggregation (pe_exec_ns, dma_ns, compute_ns)
now max-merges against the existing value in result_data rather than
overwriting. Multi-cube workloads share one result_data dict via
IO_CPU fanout; the previous overwrite caused whichever cube's M_CPU
finished last to clobber others' values, so multi-cube pe_exec_ns was
racy and frequently 0. Same fix applied in legacy/builtin/m_cpu.py.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2-rank bidirectional ring deadlock: when E and W neighbors point to the
same peer, sender-coord matching in _handle_meta_arrival / _credit_worker
picked the first direction in dict order, landing data in the wrong rx
slot relative to what the kernel recv(W) was waiting on.
Fix (ADR-0025 D1/D2/D3):
- install.reverse_direction: prefer OPPOSITE direction (E↔W, N↔S) when
peer has it pointing back to us; fallback to any matching for
topologies without opposite convention (tree_binary parent/child).
- _handle_meta_arrival: match by token.dst_addr range against each qp's
my_rx_base_pa + n_slots × slot_size window (unambiguous).
- _credit_worker: match by credit.dst_rx_base_pa == qp.peer.rx_base_pa.
- IpcqCreditMetadata: new dst_rx_base_pa field carrying receiver-side
rx base; _delayed_credit_send fills it from the consuming qp.
Tests (Phase 1 → Phase 2):
- test_reverse_direction_opposite_preference_2rank_ring
- test_reverse_direction_opposite_preference_4rank_ring_sanity
- test_meta_arrival_matches_by_dst_addr_same_peer
- test_credit_matches_by_dst_rx_base_pa_same_peer
- Existing credit-return test updated with dst_rx_base_pa.
508 tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: In ring all-reduce, PE_IPCQ's recv handler advances my_tail
and issues a credit return immediately. With tight credit latency
(0.12ns intra-cube), the sender can refill the slot BEFORE the
receiver's outbound PE_DMA reads from it for the next send. The
outbound snapshot then captures stale data from a later round.
Fix: Propagate TensorHandle.data (captured at recv-time, before credit
return) through the entire send chain:
tl.send(src=handle) → IpcqSendCmd.data → IpcqDmaToken.data
PE_DMA outbound already prefers token.data over MemoryStore read, so
the recv-time snapshot is used for the in-flight data. This eliminates
the race: the snapshot is captured before the slot can be overwritten.
Additional fixes:
- PE_MATH handle_command: compute SIMD latency from output tensor
element count via _compute_ns(), using max(overhead_ns, compute_ns).
Previously used overhead_ns=0.0 for all standalone MathCmd, making
math ops take 0ns in SimPy.
- DataExecutor secondary sort: same-t_start ops sorted by op_kind
(memory < gemm < math) so IPCQ slot writes execute before math reads.
- ipcq_copy recorded at INBOUND time (receiver PE_DMA arrival) instead
of outbound. Inbound time is after fabric propagation, so it sorts
correctly relative to the receiver's math.
- record_copy accepts explicit snapshot parameter (from token.data).
Result: N_ELEM=32 + 256-rank + n_slots=4 + cross-SIP now passes.
n_slots reverted to 4 (the deeper buffer was a workaround, not needed).
502 tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
tl.program_id(axis=0) returns local PE id within cube,
tl.program_id(axis=1) returns cube id. Enables cube-aware
sharding in benchmark kernels.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove xbar_top/bot, bridge, single noc node from topology
- Each cube_mesh.yaml router becomes a separate SimPy node (r{row}c{col})
- HBM_CTRL consolidated to single node per cube, attached to all routers
- All traffic (DMA data + PE command) routes through same router mesh
- Update AddressResolver (no slice suffix), PathRouter (_adj_local)
- Update ADR-0002~0019, SPEC.md to remove xbar/bridge references
- Regenerate SVG diagrams for new topology structure
- Skip cross-SIP PE_TCM and PE_MMU routing tests (not yet wired)
326 passed, 13 skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add cycle-accurate PE accelerator scheduler (SchedulerV2) with tiled
GEMM/Math pipelines (DMA_IN → GEMM → MATH → DMA_WB)
- Add DPPolicy num_pes/num_cubes/num_sips overrides for single-PE testing
- Support tuple target_pe for targeting specific PE subsets
- Add gemm_single_pe and gpt3_qkv benchmarks
- Switch default topology to pe_scheduler_v2
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Model fabric response hop latency for PE-internal operations:
- HBM_CTRL sends PeDmaMsg response on reverse path instead of direct done signal
- PE_CPU sends ResponseMsg via NOC→M_CPU on kernel completion
- Add NOC→PE_DMA and PE_CPU→NOC edges in topology builder
- Make HBM BW test assertions dynamic based on topology efficiency
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>