Commit Graph

5 Commits

Author SHA1 Message Date
mukesh 14d800b0ae Kernel-launch sync (ADR-0009 D5) and IPCQ drain at inbound (ADR-0023)
- KernelLaunchMsg gains target_start_ns: IO_CPU stamps a global barrier
  (max path latency across every target PE), M_CPU passes it through,
  PE_CPU yields until it before recording pe_exec_start. Every PE in a
  launch begins kernel execution at the same env.now regardless of its
  dispatch path length — eliminates per-PE dispatch-offset artifact in
  cross-PE and cross-cube latency measurements.

- PE_DMA._handle_ipcq_inbound now pays Transaction.drain_ns at the top,
  matching the terminal-drain behavior of ComponentBase._forward_txn for
  every non-IPCQ Transaction. SRC-side tl.send stays fire-and-forget
  (sender doesn't yield on sub_done); tl.recv now blocks until bytes
  have actually drained into its inbox.

- ComponentContext: new compute_path_latency_ns helper + node_overhead_ns
  field populated by GraphEngine.

- tests/test_kernel_launch_sync.py: asserts all PEs in one launch
  produce identical pe_exec_ns for a no-op kernel (zero spread).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 15:30:29 -07:00
mukesh 6918e6e906 PE-to-PE latency test + supporting fixes
Adds tests/test_pe_to_pe_latency.py: a sweep that measures PE-to-PE
transfer latency for five hop types (intra-cube horizontal/vertical,
inter-cube horizontal/vertical, inter-SIP) across data sizes 128 B to
10 KB, on both the IPCQ (tl.send/tl.recv) and raw-DMA (tl.load+tl.store)
paths. Emits per-hop PNG plots, an overview PNG, and a CSV summary into
tests/pe2pe_latency_plots/. Latency is reported as max(pe_exec_ns) across
participating PEs, read from engine.get_completion(), so the measurement
captures the SRC/DST PE's kernel body time rather than the full launch+
response-aggregation envelope.

Two simulator fixes were needed to make this measurement meaningful:

- PeMMU now stores a list of (start, end, pa) sub-regions per page
  rather than a single PA. DPPolicy layouts with shards smaller than
  page_size (e.g. 128 B payloads with 4 KB pages) used to silently
  overwrite each other through last-write-wins, causing DMAs intended
  for cube0 to physically route to cube3 - inflating latency by ~170 ns
  per DMA at small sizes. STOPGAP: real MMUs don't support sub-page
  regions; long-term fix is either smaller MMU page size or DPPolicy
  validation that refuses sub-page shards.

- M_CPU's per-PE metrics aggregation (pe_exec_ns, dma_ns, compute_ns)
  now max-merges against the existing value in result_data rather than
  overwriting. Multi-cube workloads share one result_data dict via
  IO_CPU fanout; the previous overwrite caused whichever cube's M_CPU
  finished last to clobber others' values, so multi-cube pe_exec_ns was
  racy and frequently 0. Same fix applied in legacy/builtin/m_cpu.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 21:04:31 -07:00
ywkang 5917b3497c Replace xbar/bridge/single-NOC with explicit router mesh (ADR-0019)
- Remove xbar_top/bot, bridge, single noc node from topology
- Each cube_mesh.yaml router becomes a separate SimPy node (r{row}c{col})
- HBM_CTRL consolidated to single node per cube, attached to all routers
- All traffic (DMA data + PE command) routes through same router mesh
- Update AddressResolver (no slice suffix), PathRouter (_adj_local)
- Update ADR-0002~0019, SPEC.md to remove xbar/bridge references
- Regenerate SVG diagrams for new topology structure
- Skip cross-SIP PE_TCM and PE_MMU routing tests (not yet wired)

326 passed, 13 skipped

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 17:51:28 -07:00
ywkang 114510d4b9 Add SchedulerV2 (pe_accel), DPPolicy overrides, and new benchmarks
- Add cycle-accurate PE accelerator scheduler (SchedulerV2) with tiled
  GEMM/Math pipelines (DMA_IN → GEMM → MATH → DMA_WB)
- Add DPPolicy num_pes/num_cubes/num_sips overrides for single-PE testing
- Support tuple target_pe for targeting specific PE subsets
- Add gemm_single_pe and gpt3_qkv benchmarks
- Switch default topology to pe_scheduler_v2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 23:18:49 -07:00
ywkang 63669f82cb Add SIP-level tensor parallelism, component registry YAML, VA offset verification
- DPPolicy: 3-level (sip/cube/pe), unified naming (column_wise/row_wise)
- PE_CPU: auto num_programs from cube shard count
- context.launch(): per-SIP KernelLaunchMsg with local va_base + auto local shape
- deploy_tensor: removed mmus param, MMU mapping is context-only responsibility
- ComponentRegistry: YAML-based lazy loading (components.yaml), impls→builtin rename
- VA offset bench + tests: 2D/1D, standard Triton kernel pattern

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 01:13:17 -07:00