PE-to-PE latency test + supporting fixes
Adds tests/test_pe_to_pe_latency.py: a sweep that measures PE-to-PE transfer latency for five hop types (intra-cube horizontal/vertical, inter-cube horizontal/vertical, inter-SIP) across data sizes 128 B to 10 KB, on both the IPCQ (tl.send/tl.recv) and raw-DMA (tl.load+tl.store) paths. Emits per-hop PNG plots, an overview PNG, and a CSV summary into tests/pe2pe_latency_plots/. Latency is reported as max(pe_exec_ns) across participating PEs, read from engine.get_completion(), so the measurement captures the SRC/DST PE's kernel body time rather than the full launch+ response-aggregation envelope. Two simulator fixes were needed to make this measurement meaningful: - PeMMU now stores a list of (start, end, pa) sub-regions per page rather than a single PA. DPPolicy layouts with shards smaller than page_size (e.g. 128 B payloads with 4 KB pages) used to silently overwrite each other through last-write-wins, causing DMAs intended for cube0 to physically route to cube3 - inflating latency by ~170 ns per DMA at small sizes. STOPGAP: real MMUs don't support sub-page regions; long-term fix is either smaller MMU page size or DPPolicy validation that refuses sub-page shards. - M_CPU's per-PE metrics aggregation (pe_exec_ns, dma_ns, compute_ns) now max-merges against the existing value in result_data rather than overwriting. Multi-cube workloads share one result_data dict via IO_CPU fanout; the previous overwrite caused whichever cube's M_CPU finished last to clobber others' values, so multi-cube pe_exec_ns was racy and frequently 0. Same fix applied in legacy/builtin/m_cpu.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -204,16 +204,21 @@ class MCpuComponent(ComponentBase):
|
||||
yield all_done
|
||||
del self._parent_txns[request.request_id]
|
||||
|
||||
# Aggregate PE-internal metrics (max across PEs)
|
||||
# Aggregate PE-internal metrics (max across PEs and across cubes).
|
||||
# Multiple M_CPUs share the same result_data dict via IO_CPU fanout;
|
||||
# merge against the existing value so cubes don't clobber each other.
|
||||
pe_exec_values = [st.result_data.get("pe_exec_ns", 0.0) for st in sub_txns]
|
||||
if pe_exec_values:
|
||||
txn.result_data["pe_exec_ns"] = max(pe_exec_values)
|
||||
cur = txn.result_data.get("pe_exec_ns", 0.0) or 0.0
|
||||
txn.result_data["pe_exec_ns"] = max(cur, max(pe_exec_values))
|
||||
dma_values = [st.result_data.get("dma_ns", 0.0) for st in sub_txns]
|
||||
if dma_values:
|
||||
txn.result_data["dma_ns"] = max(dma_values)
|
||||
cur = txn.result_data.get("dma_ns", 0.0) or 0.0
|
||||
txn.result_data["dma_ns"] = max(cur, max(dma_values))
|
||||
compute_values = [st.result_data.get("compute_ns", 0.0) for st in sub_txns]
|
||||
if compute_values:
|
||||
txn.result_data["compute_ns"] = max(compute_values)
|
||||
cur = txn.result_data.get("compute_ns", 0.0) or 0.0
|
||||
txn.result_data["compute_ns"] = max(cur, max(compute_values))
|
||||
|
||||
# Send aggregate response on reverse command path back to IO_CPU
|
||||
reverse_path = list(reversed(txn.path))
|
||||
|
||||
@@ -204,16 +204,21 @@ class MCpuComponent(ComponentBase):
|
||||
yield all_done
|
||||
del self._parent_txns[request.request_id]
|
||||
|
||||
# Aggregate PE-internal metrics (max across PEs)
|
||||
# Aggregate PE-internal metrics (max across PEs and across cubes).
|
||||
# Multiple M_CPUs share the same result_data dict via IO_CPU fanout;
|
||||
# merge against the existing value so cubes don't clobber each other.
|
||||
pe_exec_values = [st.result_data.get("pe_exec_ns", 0.0) for st in sub_txns]
|
||||
if pe_exec_values:
|
||||
txn.result_data["pe_exec_ns"] = max(pe_exec_values)
|
||||
cur = txn.result_data.get("pe_exec_ns", 0.0) or 0.0
|
||||
txn.result_data["pe_exec_ns"] = max(cur, max(pe_exec_values))
|
||||
dma_values = [st.result_data.get("dma_ns", 0.0) for st in sub_txns]
|
||||
if dma_values:
|
||||
txn.result_data["dma_ns"] = max(dma_values)
|
||||
cur = txn.result_data.get("dma_ns", 0.0) or 0.0
|
||||
txn.result_data["dma_ns"] = max(cur, max(dma_values))
|
||||
compute_values = [st.result_data.get("compute_ns", 0.0) for st in sub_txns]
|
||||
if compute_values:
|
||||
txn.result_data["compute_ns"] = max(compute_values)
|
||||
cur = txn.result_data.get("compute_ns", 0.0) or 0.0
|
||||
txn.result_data["compute_ns"] = max(cur, max(compute_values))
|
||||
|
||||
# Send aggregate response on reverse command path back to IO_CPU
|
||||
reverse_path = list(reversed(txn.path))
|
||||
|
||||
Reference in New Issue
Block a user