Commit Graph

64 Commits

Author SHA1 Message Date
mukesh 19dfc86dc3 Allreduce latency sweep across topologies and data sizes
Adds test_allreduce_latency_sweep that runs the existing intercube
allreduce kernel under three SIP topologies (ring_1d, torus_2d,
mesh_2d_no_wrap, all at n_sips=4) across 11 data sizes from 256 B/SIP
up to 1 MB/SIP. For each point, captures max(pe_exec_ns) — the
critical-path kernel time — and emits CSV plus log-x and linear-x
plots, both per-topology and combined overview, with KB/MB-formatted
tick labels. Reuses run_allreduce + _write_temp_configs and adds a
slot_size auto-bump when n_elem*2 exceeds the default IPCQ slot.

Sweep skips n_elem=16 because the runtime's dim_map scalar-arg
remapping (context.py:761) collides any int-valued kernel scalar that
matches a global tensor dim with its local shard size.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 10:16:29 -07:00
mukesh 14d800b0ae Kernel-launch sync (ADR-0009 D5) and IPCQ drain at inbound (ADR-0023)
- KernelLaunchMsg gains target_start_ns: IO_CPU stamps a global barrier
  (max path latency across every target PE), M_CPU passes it through,
  PE_CPU yields until it before recording pe_exec_start. Every PE in a
  launch begins kernel execution at the same env.now regardless of its
  dispatch path length — eliminates per-PE dispatch-offset artifact in
  cross-PE and cross-cube latency measurements.

- PE_DMA._handle_ipcq_inbound now pays Transaction.drain_ns at the top,
  matching the terminal-drain behavior of ComponentBase._forward_txn for
  every non-IPCQ Transaction. SRC-side tl.send stays fire-and-forget
  (sender doesn't yield on sub_done); tl.recv now blocks until bytes
  have actually drained into its inbox.

- ComponentContext: new compute_path_latency_ns helper + node_overhead_ns
  field populated by GraphEngine.

- tests/test_kernel_launch_sync.py: asserts all PEs in one launch
  produce identical pe_exec_ns for a no-op kernel (zero spread).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 15:30:29 -07:00
mukesh 6918e6e906 PE-to-PE latency test + supporting fixes
Adds tests/test_pe_to_pe_latency.py: a sweep that measures PE-to-PE
transfer latency for five hop types (intra-cube horizontal/vertical,
inter-cube horizontal/vertical, inter-SIP) across data sizes 128 B to
10 KB, on both the IPCQ (tl.send/tl.recv) and raw-DMA (tl.load+tl.store)
paths. Emits per-hop PNG plots, an overview PNG, and a CSV summary into
tests/pe2pe_latency_plots/. Latency is reported as max(pe_exec_ns) across
participating PEs, read from engine.get_completion(), so the measurement
captures the SRC/DST PE's kernel body time rather than the full launch+
response-aggregation envelope.

Two simulator fixes were needed to make this measurement meaningful:

- PeMMU now stores a list of (start, end, pa) sub-regions per page
  rather than a single PA. DPPolicy layouts with shards smaller than
  page_size (e.g. 128 B payloads with 4 KB pages) used to silently
  overwrite each other through last-write-wins, causing DMAs intended
  for cube0 to physically route to cube3 - inflating latency by ~170 ns
  per DMA at small sizes. STOPGAP: real MMUs don't support sub-page
  regions; long-term fix is either smaller MMU page size or DPPolicy
  validation that refuses sub-page shards.

- M_CPU's per-PE metrics aggregation (pe_exec_ns, dma_ns, compute_ns)
  now max-merges against the existing value in result_data rather than
  overwriting. Multi-cube workloads share one result_data dict via
  IO_CPU fanout; the previous overwrite caused whichever cube's M_CPU
  finished last to clobber others' values, so multi-cube pe_exec_ns was
  racy and frequently 0. Same fix applied in legacy/builtin/m_cpu.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 21:04:31 -07:00
mukesh 1d8b9401e5 Intercube allreduce: pe0 cube-mesh reduce + multi-SIP ring/torus/mesh
New intercube allreduce kernel replacing the old flat ring algorithms.
Reduces across the 4x4 cube mesh within each SIP (pe0-only, same-lane),
then inter-SIP exchange on root cube, then broadcast back. Supports
ring_1d, torus_2d, and mesh_2d_no_wrap SIP topologies driven by
topology.yaml. Integrated with dist.init_process_group / dist.all_reduce.

New files:
- src/kernbench/ccl/algorithms/intercube_allreduce.py (kernel)
- src/kernbench/ccl/sfr_config.py (configure_sfr_intercube_multisip)
- tests/test_allreduce_multidevice.py (config-driven, 3 topologies)
- tests/test_distributed_intercube_allreduce.py (full distributed path)
- tests/test_intercube_sfr_config.py (SFR wiring verification)

Modified:
- distributed.py: AhbmCCLBackend uses configure_sfr_intercube_multisip
- topologies.py: added torus_2d, mesh_2d_no_wrap
- install.py: global_E/W/N/S in _OPPOSITE_DIR
- topology.yaml: added system.sips.topology
- ccl.yaml: single intercube_allreduce algorithm
- benches/ccl_allreduce.py: row_wise cube-mesh tensor layout

Removed old flat-ring algorithms and their tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 17:33:42 -07:00
ywkang cfc2d74ec4 Refactor ccl_allreduce bench: rank=SIP only, remove rank=PE legacy path
The unified ccl_allreduce bench previously carried two execution models
in one worker with ``if world_size == n_sips:`` branching:
  - TP mode (rank = SIP, ADR-0024/0027): proper ProcessGroup semantics.
  - Legacy rank = PE mode: single-driver worker allocating one big tensor
    distributed across all PEs via _derive_dp, with kernel-level SPMD via
    program_id.

The second model is unnecessary — intra-SIP PE-level collectives are
expressed inside the kernel (tl.send/tl.recv with program_id, IPCQ) and
do not need a host-side ProcessGroup. Removing it lets the bench be a
clean reference implementation of the TP launcher.

benches/ccl_allreduce.py:
- Config resolved once in run() via _resolve_cfg -> _BenchCfg dataclass.
- rank != n_sips now raises RuntimeError explicitly.
- _worker / _allocate_rank_tile / _init_with_rank_value / _report each
  have one concern; duplicated init + verification paths collapsed.
- _derive_dp and the second verify+print block deleted.
- 166 lines -> 91 lines.

ccl.yaml:
- mesh_allreduce_4 (world_size: 4) and tree_allreduce_7 (world_size: 7)
  algorithm entries removed (rank = PE only).
- Algorithm kernel files (kernbench.ccl.algorithms.mesh_allreduce,
  tree_allreduce) kept as-is for direct-dispatch future use.

tests/test_ccl_allreduce_matrix.py:
- Matrix shrinks from 7 cases to 3: ring × {tcm, hbm, sram} at ws =
  topology SIP count (= 2). mesh_2x2, tree_binary_7, ring_multi_cube,
  and the three ring_*_8 cases removed.

tests/test_ccl_performance.py:
- _run_8rank renamed to _run_ring; world_size: 8 override dropped; now
  exercises rank = SIP ring all-reduce.

tests/test_mp_spawn.py, tests/test_ccl_ddp_launcher.py:
- Monkeypatch target updated from bench.worker to bench._worker
  (signature now takes BenchCfg instead of (rank, world_size)).

555 passed, 1 intentional skip. Tests that directly call
install_ipcq(world_size_override=N) for kernel-level sanity
(test_ccl_hello_world_guide, test_recv_copy_to_dst, test_tl_recv_async,
test_ccl_deadlock_detection) are unchanged — they never went through
the bench and still exercise the kernel-only path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 16:45:27 -07:00
ywkang 105f1dc09e ADR-0027: Megatron TP API + worker-wait generalization + mp.spawn
Implements ADR-0027 Phase 2 end-to-end. All 559 tests pass (was 523 +
1 xfail; ring_default_ws strict-xfail is now resolved).

D0 — Worker-wait generalization (context.py):
- _pending_worker_waits queue on RuntimeContext.
- ctx.wait(h) in worker context defers to main via g.parent.switch().
  Fast-path for already-completed handles.
- Worker API is unchanged: tensor deploy, launch, etc. still look
  synchronous; they're transparently cooperatively scheduled.
- Solves ADR-0024 Phase B kernel-greenlet orphan bug (env.run now
  only ever drives from main; kernel _parent is always main).

D0.5 — Host-read barrier (tensor.py):
- Explicit _HOST_READ_BARRIERS registry (T5.g closed-set via code
  review, not reflection-magic).
- numpy/data/__getitem__/__repr__ drain pending worker-waits before
  host-observable read.
- copy_: source-side barrier via source.numpy(). Target-side write
  barrier is intentionally NOT applied — global pending target barrier
  prematurely drains cross-rank collectives → deadlock.
- Collective pending is excluded from barrier drain condition
  (collective is cross-rank; its own yield in all_reduce covers the
  invariant naturally).

D1 — torch.multiprocessing.spawn (runtime_api/multiprocessing.py):
- API signature parity with real PyTorch spawn; execution is
  cooperative greenlet scheduler (process isolation etc. are explicit
  non-goals per D1.0).
- _drain_pending drains worker-waits then collectives in one barrier,
  loop-until-empty.
- Round-based exception handling with SystemExit sibling abort +
  SpawnException(errors) wrapping root-cause ranks.
- RuntimeContext attaches ctx.multiprocessing in __post_init__.
- benches/ccl_allreduce.py hand-rolled loop collapses to one
  torch.multiprocessing.spawn call.

D2–D6 — kernbench.tp package:
- parallel_state: initialize_model_parallel, get_*_rank,
  get_*_world_size, with weak active-ctx registry in context.py.
- layers: ColumnParallelLinear, RowParallelLinear (shape-only
  primitives — fp16 gemm via tl.load + tl.dot + tl.store).
- kernels: _gemm_kernel used by TP layers (self-contained; no bench
  dependency).
- primitives / mappings stubs per D6/D8.

Data-path fixes (surfaced by TP gemm + all_reduce sequence):
- sim_engine/op_log.py: dma_write snapshot is skipped for TCM
  sources (PE scratch is repopulated by Phase 2 math/gemm replay —
  capturing Phase-1-time snapshot picked up STALE data from prior
  kernel's output aliased at the same scratch addr, causing the later
  kernel's dma_write to overwrite Phase 2 result with stale value).
- sim_engine/op_log.py + sim_engine/data_executor.py: per-operand
  space recorded on GemmCmd and composite gemm records so HBM-resident
  operands (tl.load output) don't default to TCM during replay.
- runtime_api/context.py: ctx.zeros writes zero-init to MemoryStore
  at VA keys so kernels reading via VA see deterministic init even
  without explicit copy_().

Tests (Phase 1 + Phase 2):
- test_worker_wait_drain (T3): orphan invariant + resume + multi-rank
  drain + idempotency + exception propagation.
- test_mp_spawn (T4): spawn shape + bind + SpawnException scope.
- test_host_read_barrier (T5): barrier contract per entry-point +
  closed-set registry check.
- test_tp_parallel_state (T1): initialize + rank lookup.
- test_tp_layers (T2): shape + deterministic numerical correctness
  (concat-matmul equality for RowParallel, not mean-only).
- test_tp_mlp (T6): full 2-layer MLP with deterministic weight
  numerical match + rank-consistency post all-reduce.
- test_ccl_allreduce_matrix: ring_default_ws xfail removed (T7).

Regression: 523 pre + 35 new + 1 ex-xfail = 559 passed, 1 intentional
skip (T3.e historical failure documentation).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 16:31:13 -07:00
ywkang e7f376ebaa ADR-0027 rev7 (Megatron TP + worker-wait generalization) + ADR-0026 typo fix
ADR-0027 is a design-only change (no production code). Rev 7 closes design
across 7 iterations of review. Key decisions:

- D0 (worker-wait generalization): ctx.wait in worker context yields to
  main scheduler, which drains env.run. Solves ADR-0024 Phase B orphan
  bug (ring_default_ws strict xfail). Normative contracts on resume
  invariant, fast-path, main-context non-reentrance, barrier
  loop-until-empty, and scheduler non-progress as user contract.
- D0.5 (host-read barrier): Tensor.numpy/data/__getitem__/__repr__/copy_
  auto-drain pending before reading. Closed-set via explicit registry
  (T5.g). copy_ uses global-pending barrier with explicit
  over-serialization tradeoff.
- D1 (torch.multiprocessing.spawn): real-PyTorch API-signature parity,
  cooperative greenlet scheduler internally. Explicit non-goal on
  process isolation / address space / failure isolation. Sibling
  cleanup via SystemExit + SpawnException(errors) wrapping root-cause
  ranks.
- D4/D5 (TP layers): ColumnParallelLinear / RowParallelLinear use
  torch.launch(gemm_kernel) — no host-side torch.matmul. Yield-safety
  contract normatively required for all TP forward paths.
- Supersedes ADR-0024 D7/D12/D13 as design (none landed). Source of
  truth declared normative.

Test strategy: T1-T8 with numerical-correctness primary (not mean/
aggregate-only), orphan invariant direct assertion, host-read barrier
closed-set via registry. Phase 2 acceptance = 524 passed + 0 xfail
(ring_default_ws unblocked by D0).

ADR-0026 typo fix: torch.cuda.set_device → torch.ahbm.set_device in
DPPolicy docstring (ADR-0024 D10 convention).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 14:13:26 -07:00
ywkang 357cab525b ADR-0026: DPPolicy intra-device only + ShardSpec structural coords
DPPolicy no longer carries a cross-SIP axis. SIP-level placement is
solely controlled by torch.ahbm.set_device(rank) (ADR-0024); DPPolicy
itself describes only the cube × PE layout within one SIP. ShardSpec
switches to structural (sip, cube, pe) coordinates; the flat pe_index
field/property is fully removed — silent drift between global-flat and
SIP-local interpretations was a foot-gun flagged by ADR-0024 D11.

Breaking API (explicit TypeError / AttributeError):
- DPPolicy(sip=...) / DPPolicy(num_sips=...) -> TypeError
- ShardSpec.pe_index -> AttributeError
- ShardSpec(pe_index=...) -> TypeError
- resolve_dp_policy now takes target_sip= (required), no num_sips.

Downstream migration:
- PE allocator dict keyed by (sip, cube, pe) tuples, in both
  _ensure_allocators and _free_tensor. deploy_tensor uses tuple lookup.
- _create_tensor passes target_sip=current_sip; post-hoc pe_index
  shifting removed entirely.
- launch._compute_local_shape drops the dp.sip branch.
- Internal resolvers (column_wise / row_wise / replicate / tiled_*)
  return _LocalPeShard (cube-local identifier) instead of ShardSpec —
  resolve_dp_policy lifts them to full structural coords.

Tests:
- New tests/test_adr0026_dppolicy_intra_device.py (12 tests) pins the
  contract end-to-end.
- test_sip_parallel.py rewritten: SIP composition now modeled as two
  resolve_dp_policy(target_sip=...) calls (ADR-0024 launcher style).
- Call-site migration: test_tensor, test_va_integration, test_va_offset,
  test_runtime_api_tensor, test_tl_recv_async, test_ccl_* and benches
  gemm_single_pe, gpt3_qkv, va_offset_verify, ccl_allreduce (legacy
  branch) all use intra-device DPPolicy and structural ShardSpec.

Result: 523 passed, 1 strict xfail (ring_default_ws — unchanged
ADR-0024 Phase B blocker; architectural fix deferred to ADR-0027).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 13:02:19 -07:00
ywkang 787409ced1 ADR-0024 Phase B: update xfail reason with architectural blocker details
Phase B Option A (freeze + defer to ADR-0027): the root cause of
ring_default_ws strict-xfail is that bench workers call torch.zeros /
copy_ which drive env.run in the WORKER-greenlet context. Any pending
KernelLaunchMsg gets stepped inside that worker, spawning kernel_runner
with parent = worker (not main). When the worker yields/finishes, the
kernel greenlet is orphaned and its next switch_to_simpy raises
GreenletExit mid-add — producing rank 0 mean=1 (expected 3).

This is a larger architectural redesign (lazy-deploy tensor API,
coroutine worker, or setup/verify split) and is parked until ADR-0027
(Megatron TP) starts, where the proper solution ships with TP use cases.

No production changes; xfail reason + inline comment only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 12:46:33 -07:00
ywkang 79124daab1 ADR-0024 Phase B (partial): scheduler-level collective drain
Root cause (hang diagnosis):
`kernel_runner.run()` captures `greenlet.getcurrent()` at spawn time as
the kernel greenlet's `_parent`. When a worker greenlet (say g0) calls
`dist.all_reduce` → `ctx.wait(h)` → `env.run(until=h0)`, the SimPy
scheduler steps pe_cpu processes, which in turn spawn kernel greenlets.
Those kernels' `_parent` becomes g0 (current greenlet at spawn). When a
kernel yields via switch_to_simpy, control jumps back up to g0's LAST
switch point — which is the main scheduler's `g.switch()` call — rather
than the kernel_runner's generator frame. Main then re-enters its
`for g in alive: g.switch()` loop mid-wait, producing nested greenlet
re-entry. Scheduler spins: g0 never completes, g1 appears to complete
out of order, infinite loop at 100% CPU.

Fix:
- AhbmCCLBackend.all_reduce: in multi-greenlet mode, submit via
  launch(_defer_wait=True), extend backend._pending_collective_handles,
  and yield to the parent greenlet. Worker does NOT call wait.
- benches/ccl_allreduce.py run(): after each scheduler round, the MAIN
  greenlet drains backend._pending_collective_handles. This keeps
  env.run invocation in the main context, so kernel_runner's spawned
  kernel greenlets have main as their _parent — no nested re-entry.
- Legacy single-driver path (no bench scheduler): all_reduce falls back
  to inline wait when g.parent is None.

Result:
- Multi-greenlet cross-SIP ring no longer hangs (was 100% CPU infinite
  loop in kernel_runner._switch_kernel).
- ring_default_ws still xfail(strict=True): now fails as a data
  correctness issue — DataExecutor reports only 1 math op for a 2-rank
  ring (expected 2). Cross-SIP op_log replay integration is the
  remaining Phase B task.

514 passed, 1 xfailed (strict).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 09:14:03 -07:00
ywkang 4ba0a83e71 Implement ADR-0024 Phase A: SIP-level TP launcher MVP
Scope (Phase A):
- D1: world_size fallback = SIP count (rank = SIP, TP boundary)
- D9: greenlet-local get_rank + _bind_rank (single-driver fallback = 0)
- D10: torch.ahbm.set_device + torch.accelerator.set_device_index alias
- D11: tensor placement scoped to current-device SIP (post-hoc pe_index
  shift — ADR-0026 replaces with structural coords)
- D12/D13: multi-greenlet run() with simple round-robin scheduler;
  hybrid dispatch (ws == SIP count → multi-greenlet, else legacy
  single-worker for ccl.yaml override compat)
- D7 partial: backend.all_reduce submit + yield + wait via launch()'s
  new _defer_wait flag; parent-less greenlets skip yield
- Relaxed shard-count check (len(shards) > 0 instead of == world_size)
- rank_to_pe = SIP-representative [(r, 0, 0)] when ws <= n_sips

Deferred to Phase B:
- Engine-routed install (D2) — keeps sideband
- install_plan.py module (D6) — keeps install.py
- Epoch barrier (D7 full) — simple yield is sufficient for ring ws=2 mock
- Validator registry (D8)
- Cross-SIP multi-greenlet + real kernel integration — matrix
  ring_default_ws hangs in SimPy drain despite ADR-0025 direction fix;
  marked xfail(run=False) pending Phase B diagnosis (suspected per-rank
  kernel_args / program_id mismatch)

Tests:
- test_ccl_ddp_launcher.py (6 new tests) — D1/D9/D10/D11/D12/D13
- test_ccl_allreduce_matrix.py — ring_default_ws xfail'd, override
  cases (ring_tcm_8 / hbm_8 / sram_8 / multi_cube / mesh_2x2 /
  tree_binary_7) all pass via legacy path

514 tests pass, 1 xfail.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 09:00:28 -07:00
ywkang 32536daf2e Fix ADR-0025: IPCQ direction addressing via address-based matching
2-rank bidirectional ring deadlock: when E and W neighbors point to the
same peer, sender-coord matching in _handle_meta_arrival / _credit_worker
picked the first direction in dict order, landing data in the wrong rx
slot relative to what the kernel recv(W) was waiting on.

Fix (ADR-0025 D1/D2/D3):
- install.reverse_direction: prefer OPPOSITE direction (E↔W, N↔S) when
  peer has it pointing back to us; fallback to any matching for
  topologies without opposite convention (tree_binary parent/child).
- _handle_meta_arrival: match by token.dst_addr range against each qp's
  my_rx_base_pa + n_slots × slot_size window (unambiguous).
- _credit_worker: match by credit.dst_rx_base_pa == qp.peer.rx_base_pa.
- IpcqCreditMetadata: new dst_rx_base_pa field carrying receiver-side
  rx base; _delayed_credit_send fills it from the consuming qp.

Tests (Phase 1 → Phase 2):
- test_reverse_direction_opposite_preference_2rank_ring
- test_reverse_direction_opposite_preference_4rank_ring_sanity
- test_meta_arrival_matches_by_dst_addr_same_peer
- test_credit_matches_by_dst_rx_base_pa_same_peer
- Existing credit-return test updated with dst_rx_base_pa.

508 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 00:38:41 -07:00
ywkang e1084800ab docs: add ADRs 0024–0031 for SIP-TP launcher stack
ADR-0024 (SIP-level TP launcher): rank = SIP abstraction, engine-routed
  install, mp.spawn parity, epoch barrier, ShardSpec structural coords.
ADR-0025 (IPCQ direction addressing): address-based matching for meta
  arrival and credit return; fixes 2-rank bidirectional ring deadlock.
ADR-0026 (DPPolicy intra-device only): remove sip/num_sips fields;
  ShardSpec uses structural (sip, cube, pe); pe_index property removed.
ADR-0027 (Megatron-style TP API): ColumnParallelLinear / RowParallelLinear
  on top of ADR-0024 launcher. Backlog until 0024/0025/0026 land.
ADR-0028 (DTensor support): stub / future work.
ADR-0029 (Hierarchical all-reduce): 3-level reduce using all_pes mapper
  and multi_pe_sip_local validator from ADR-0024. Backlog.
ADR-0030 (IPCQ PhysAddr integration): blocked on ADR-0031.
ADR-0031 (PhysAddr PE-resource extension): stub; local_offset range-based
  partition approach; specific ranges TBD.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 00:38:27 -07:00
ywkang b2c52f0e34 Add English translations for ADR-0018, 0019, 0020, 0021
- ADR-0018: LA-based memory address abstraction + BAAW + HBM channel mapping
- ADR-0019: CUBE NOC per-channel and aggregated HBM connection model
- ADR-0020: 2-pass data execution model (timing/data separation, greenlet)
- ADR-0021: PE pipeline refactor (component separation + token self-routing)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 16:31:32 -07:00
ywkang 10b33b44ba Add Tensor indexing + hierarchical 3-level all-reduce kernel
Tensor.__setitem__ / __getitem__:
- Shard-aligned slice assignment and read on deployed tensors.
- Scalar broadcast and numpy array assignment supported.
- Cross-shard slices raise NotImplementedError (use copy_ for that).
- 3 new tests: single-PE, multi-PE, cross-shard error case.

Hierarchical all-reduce kernel (src/kernbench/ccl/algorithms/):
- 3-level reduce: intra-cube (E/W) → inter-cube (N/S) → inter-SIP (parent).
- Bidirectional ring reduce at each level: ceil((N-1)/2) rounds.
  Left half sends via dir_dec, right half via dir_inc (wrap).
  Representative receives from both sides.
- Chain broadcast for reverse path: cube 0 PE 0 → all PE 0s → all PEs.
- Registered in ccl.yaml as "hierarchical_allreduce" with topology: none
  (neighbors() override builds the full 3-level neighbor map).
- kernel_args derives pes_per_cube/cubes_per_sip/num_sips from world_size.
- Mock-verified at 8/16/32/64/128 ranks.

Mock runtime fixes:
- Direction pairing: explicit N↔S, E↔W, parent↔parent instead of
  "first matching reverse". Fixes 2-element rings where N and S both
  point to the same peer.
- Deadlock detection: send-counter based (not just queue-depth-total)
  to catch chain reductions where send+recv pairs net to zero.
- Multi-cube program_id: pes_per_cube parameter enables
  program_id(axis=0) = PE within cube, program_id(axis=1) = cube id.
  Legacy single-cube tests unaffected (default = world_size).

504 tests pass in 12s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 23:52:04 -07:00
ywkang 1c8ddc2d03 Fix Phase 1 slot-overwrite race + PE_MATH latency model (n_slots=4 safe)
Root cause: In ring all-reduce, PE_IPCQ's recv handler advances my_tail
and issues a credit return immediately. With tight credit latency
(0.12ns intra-cube), the sender can refill the slot BEFORE the
receiver's outbound PE_DMA reads from it for the next send. The
outbound snapshot then captures stale data from a later round.

Fix: Propagate TensorHandle.data (captured at recv-time, before credit
return) through the entire send chain:
  tl.send(src=handle) → IpcqSendCmd.data → IpcqDmaToken.data
PE_DMA outbound already prefers token.data over MemoryStore read, so
the recv-time snapshot is used for the in-flight data. This eliminates
the race: the snapshot is captured before the slot can be overwritten.

Additional fixes:
- PE_MATH handle_command: compute SIMD latency from output tensor
  element count via _compute_ns(), using max(overhead_ns, compute_ns).
  Previously used overhead_ns=0.0 for all standalone MathCmd, making
  math ops take 0ns in SimPy.
- DataExecutor secondary sort: same-t_start ops sorted by op_kind
  (memory < gemm < math) so IPCQ slot writes execute before math reads.
- ipcq_copy recorded at INBOUND time (receiver PE_DMA arrival) instead
  of outbound. Inbound time is after fabric propagation, so it sorts
  correctly relative to the receiver's math.
- record_copy accepts explicit snapshot parameter (from token.data).

Result: N_ELEM=32 + 256-rank + n_slots=4 + cross-SIP now passes.
n_slots reverted to 4 (the deeper buffer was a workaround, not needed).
502 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 23:02:19 -07:00
ywkang 74f5f5cf08 Add session-scoped topology fixture in tests/conftest.py
Provides a shared `topology` fixture that caches the parsed
topology.yaml result per pytest-xdist worker session. Tests that
build a GraphEngine can accept `topology` instead of calling
resolve_topology("topology.yaml") repeatedly.

Topology parsing costs ~32ms, so the practical saving per worker is
modest (<1s across all tests). The fixture is mainly for architectural
cleanliness — keeping the "parse once, build engine many" pattern
explicit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 21:13:25 -07:00
ywkang 372c987995 Reduce test time to 12s: shrink GEMM dims + enable pytest-xdist
GEMM dimension reduction:
- qkv_gemm.py: M,K,N = 128,256,128 → 32,64,32 (64 tiles → 1 tile).
- qkv_gemm_multi_pe.py: same reduction.
- Tests verify pipeline correctness, not large-matrix throughput.
- Per-test time: 18s → 1.7s. 6 tests total: 108s → 10s.

pytest-xdist parallel execution:
- Add pytest-xdist to dev dependencies.
- pyproject.toml addopts: -n auto (use all CPU cores), -m "not slow".
- Default `pytest` runs 501 tests in ~12s (previously 148s).
- Full suite including slow: `pytest -m ""` → 3m24s (previously 5m43s).

pytest.mark.slow:
- Registered in pyproject.toml markers section.
- 256-rank full-system test is the only slow-marked test.
- Run with: pytest -m "" (CI) or pytest (local dev, skips slow).

502 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 21:06:41 -07:00
ywkang bcf941dcee Speed up regression: 25min → 6min (test matrix + DataExecutor cleanup)
Test matrix restructure:
- 256-rank full-system ring runs only ONCE (marked pytest.mark.slow)
  instead of 7× across matrix + perf tests. Cross-SIP routing is
  verified by the single run; buffer variants (tcm/hbm/sram) are
  tested at 8-rank where they finish in <0.5s.
- Performance tests use 8-rank instead of 256-rank.
- `pytest -m "not slow"` completes in ~2.5min (local dev).
- Full suite including slow: ~6min (CI).

DataExecutor optimization:
- Remove ThreadPoolExecutor from DataExecutor.run(). Same-t_start
  groups are almost always size 1, so the thread pool creation and
  dispatch overhead dominated. Simple sequential loop is faster.
- Skip dma_read ops at the loop level (they are always no-ops in
  Phase 2 but were dispatched through _execute_op → _execute_memory).
- Remove redundant CLI Phase 2 re-execution: engine._flush_data_phase
  already replays during engine.wait(); the CLI now only prints the
  diagnostic summary without re-running DataExecutor.

502 tests pass. Wall time: 25m30s → 5m43s (full), 2m28s (no slow).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 20:52:07 -07:00
ywkang 998cc85762 Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)
Major changes:

PE-level IPCQ infrastructure:
- New PE_IPCQ component: ring-buffer control plane with 4-direction
  neighbor mapping, head/tail pointers, backpressure (poll/sleep).
- PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA,
  including in-flight data snapshot (D9) and op_log recording at
  outbound time for Phase 2 replay correctness.
- IpcqDmaToken piggyback model: data + metadata travel together,
  atomic visibility at receiver (invariant I6).
- Credit return fast path: bottleneck-BW latency, no fabric vc_comm.

Phase 2 data execution (ADR-0020 integration):
- op_log extended: DmaWriteCmd now captures src_space/src_addr for
  Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time.
- DataExecutor replays dma_write + ipcq_copy in t_start order.
- Engine._flush_data_phase: incremental cursor-based replay after
  each engine.wait() so host reads see post-Phase-2 data.
- KernelRunner Phase 1 writes disabled when op_log is active to
  prevent stale data from corrupting the MemoryStore snapshot.

TLContext / kernel API:
- tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype),
  tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode.
- TensorHandle operator overloading (add/sub/mul/div) via thread-local
  active TLContext → MathCmd dispatch through PE_MATH.
- PE-local scratch allocator for math output handles.
- tl.load returns space="hbm" handles for correct Phase 2 addressing.
- Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv.

Unified ccl_allreduce bench (PyTorch-compat host code):
- Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch)
  split matching real PyTorch DDP worker pattern.
- torch.distributed facade: init_process_group, get_world_size,
  get_rank, get_backend, all_reduce, barrier — only real PyTorch names.
- AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches
  kernel via tensor shard metadata (n_elem from shards[0].nbytes).
- world_size derived from topology spec (sips × cubes × pes_per_cube)
  with optional algorithm-level override in ccl.yaml.

Tensor API (PyTorch-compat surface):
- Tensor.numpy(): gather-aware (all shards via VA-based addressing).
- Tensor.copy_(source): scatter from host tensor into sharded target.
- RuntimeContext.from_numpy(arr): host-side staging tensor.
- Tensor.data property fixed to use numpy() (was shards[0]-only).

Algorithm modules moved to src/kernbench/ccl/algorithms/:
- ring_allreduce, mesh_allreduce, tree_allreduce, hello_send.
- Each module exports kernel_args(world_size, n_elem) helper.
- ccl.yaml module paths updated to kernbench.ccl.algorithms.*.

Dead code removed:
- 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.).
- _run_ccl_bench greenlet-per-SIP scheduler.
- benches.loader.is_ccl_bench + run_rank detection.
- benches/ccl/ directory.

Tests:
- New test_ccl_allreduce_matrix.py: 7 parametrized cases
  (ring×3 buffers, ring 8/16, mesh 4, tree 7).
- New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests.
- Existing tests updated for new import paths + world_size_override.

Docs:
- Korean ccl-author-guide.md and ADR-0023 paths updated.
- New English versions: ccl-author-guide.en.md, ADR-0023.en.md.

502 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 19:36:59 -07:00
ywkang ff2c677a9c Add 2D grid program_id semantics (ADR-0022)
tl.program_id(axis=0) returns local PE id within cube,
tl.program_id(axis=1) returns cube id. Enables cube-aware
sharding in benchmark kernels.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 16:49:56 -07:00
ywkang dc3fb02aed Add --verify-data CLI flag, Tensor.data property, parallel DataExecutor
- CLI: --verify-data flag enables Phase 2 data verification (ADR-0020)
- Tensor.data: returns actual numpy values (verify-data) or zeros placeholder
- Tensor.__repr__: shows value summary or data=N/A (placeholder)
- DataExecutor: ThreadPoolExecutor for same-timestamp parallel op execution
- BenchResult.engine: exposes op_log/memory_store for Phase 2 access

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 09:34:01 -07:00
ywkang 59e36f0c34 Add E2E pipeline tests: greenlet op_log, GEMM accuracy, latency regression
ADR-0020 + ADR-0021 final verification:
- CompositeCmd GEMM/Math pipeline completes through full chain
- Greenlet mode generates op_log records (memory + gemm ops)
- Phase 1→Phase 2: MemoryStore seed → greenlet → op_log → DataExecutor → allclose
- Latency determinism: same kernel produces identical latency
- Multi-tile > single-tile latency invariant

388 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 00:28:03 -07:00
ywkang 81ce55571d Rename impl names: add builtin. prefix for clear provenance
- components.yaml: all builtin impls use builtin.xxx naming
- topology.yaml: all impl references updated to builtin.xxx
- builder.py: hardcoded ucie impl → builtin.ucie
- Tests: all impl string references updated

Convention: builtin.<name> for built-in, custom.<name> for user-defined.
382 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 00:16:24 -07:00
ywkang 1d95df4bee Restructure legacy backups, remove pe_accel, fix DMA self-routing
- Move builtin_legacy/ → legacy/builtin/ (cleaner structure)
- Move pe_accel_legacy/ → legacy/pe_accel/
- Remove custom/pe_accel/ (replaced by new builtin)
- Remove pe_scheduler_v2 from components.yaml
- Switch topology.yaml to pe_scheduler_v1 (new builtin)
- Fix PE_DMA self-routing: handle consecutive DMA_READ stages
  (same component consecutive stages processed in-place, not via port)

382 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 00:02:26 -07:00
ywkang 95d583ef9f Add Phase 1→Phase 2 e2e data tests + GraphEngine enable_data mode
GraphEngine(enable_data=True):
- Creates MemoryStore + OpLogger
- Injects op_logger into all components
- Exposes engine.op_log and engine.memory_store properties

E2E tests (test_e2e_data.py):
- Engine data mode creates store + logger
- Default engine has no store
- PeDmaMsg completes successfully with data mode
- DataExecutor GEMM accuracy: random f16 matmul with f32 accumulation
- DataExecutor chain: GEMM → exp correctness
- DataExecutor verify API: pass/fail per tensor
- MemoryStore snapshot isolation between Phase 1 and Phase 2

382 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 23:49:28 -07:00
ywkang f5d1606f9d Add ADR-0021 pipeline tests: self-routing, tiling, overlap
Test plan items 3-5:
- TileToken self-routing: advance(), stage sequence, chain traversal
- PipelineContext: completion tracking, exactly-once contract
- Tiling plans: GEMM tile count, stage sequence, intermediate K no DMA_WRITE
- Math plan: READ→FETCH→MATH→STORE→WRITE sequence
- Pipeline overlap: SimPy simulation verifying intra-command tile overlap

9 new tests, all passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 23:40:19 -07:00
ywkang b6eb97c49a Implement ADR-0021: PE pipeline refactor with token self-routing
Step 1-2: Backup existing code
- builtin/ → builtin_legacy/ (unchanged backup)
- custom/pe_accel/ → custom/pe_accel_legacy/ (unchanged backup)

Step 3-4: New pipeline types and tiling
- pe_types.py: StageType, Stage, TilePlan, PipelinePlan, PipelineContext, TileToken
- tiling.py: generate_gemm_plan, generate_math_plan (ported from pe_accel)

Step 5: Component implementations (ADR-0021 D4-D6)
- PE_SCHEDULER: _feed_loop (singleton FIFO feeder) + plan generation
- PE_FETCH_STORE: new component — TCM ↔ Register File
- PE_GEMM: TileToken pipeline + legacy PeInternalTxn dual-mode
- PE_MATH: TileToken pipeline + legacy dual-mode
- PE_DMA: TileToken pipeline + legacy + fabric Transaction triple-mode
- PE_TCM: TcmRequest handler with dual-channel BW serialization

Step 6: Infrastructure
- topology.yaml: pe_fetch_store component + chaining edges
- components.yaml: pe_fetch_store_v1 registration
- builder.py: PE_COMP_OFFSETS, _add_pe_internal_edges, PE view positions
- Tests: node/edge counts, PE component sets updated

All components handle both TileToken (pipeline) and PeInternalTxn (legacy).
Token self-routing: components read next stage from token.plan, chain via out_port.
366 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 23:35:31 -07:00
ywkang 161132cdcb ADR-0021: PE pipeline refactor — component separation + token self-routing
Design for refactoring pe_accel monolith into independent builtin components:
- D1: 6 independent components (scheduler, DMA, fetch_store, GEMM, MATH, TCM)
- D2: Token self-routing — scheduler only dispatches + tracks completion
- D3: done signal = simpy.Event (HW wire), data = message (queue)
- D4: Async pipeline with single FIFO feeder, command-level ordering
- D5: PE_FETCH_STORE separates TCM↔register from compute
- D6: Compute components implement _process() only, chaining in base
- D7: Topology adds pe_fetch_store + chaining edges
- D8: Existing builtin/pe_accel → builtin_legacy backup, new builtin
- D9: TileToken with plan + stage_idx for self-routing

Key decisions from review:
- No PipelineManager object — scheduler + existing ports sufficient
- PipelineContext with exactly-once completion contract
- _feed_loop singleton per scheduler, FIFO command ordering
- Intra-PE chaining: no explicit latency model
- Latency models ported from pe_accel current implementation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 23:21:40 -07:00
ywkang 51004c311c Implement ADR-0020: 2-pass data execution with greenlet kernel runner
Step 1 — Foundation:
- OpRecord/OpLogger: op log infrastructure with t_start stable ordering
- MemoryStore: numpy ndarray tensor-granular storage (reference semantics)
- data_op=True flag on DmaReadCmd, DmaWriteCmd, GemmCmd, MathCmd, CompositeCmd
- numpy/greenlet dependencies added to pyproject.toml

Step 2 — ComponentBase hooks:
- _on_process_start/end hooks in _forward_txn (fabric messages)
- _handle_with_hooks in PeEngineBase (PE-internal commands)
- op_logger optional — zero overhead when disabled

Step 3 — KernelRunner + greenlet:
- KernelRunner: greenlet ↔ SimPy bridge in triton_emu/kernel_runner.py
- TLContext: _emit() method routes to greenlet switch or command list
- tl.load() returns real numpy data in greenlet mode
- Dynamic control flow supported (memory-read based branching)

Step 4 — PE_CPU integration:
- Greenlet mode when ctx.memory_store is set, legacy fallback otherwise
- Refactored into _execute_greenlet/_execute_legacy/_send_response
- ComponentContext gains memory_store and op_logger fields

Step 5 — DataExecutor:
- Phase 2 numpy execution for GEMM/Math ops from op_log
- _compute_math: all unary/binary/reduction ops
- verify(): compare MemoryStore against expected with dtype tolerance

28 new tests, 366 total passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 00:22:44 -07:00
ywkang 140b85436a ADR-0020: 2-Pass data execution model with greenlet kernel runner
Design for actual data storage/computation in HBM/TCM/SRAM components:
- Phase 1: SimPy timing + MemoryStore (memory ops data-aware via greenlet)
- Phase 2: op_log-based numpy execution for GEMM/Math verification
- Greenlet-based KernelRunner replaces Phase 0 command list generation
- tl.load() returns real data in Phase 1, enabling memory-based control flow
- ComponentBase hook for op logging (single source of truth)
- MemoryStore: numpy ndarray tensor-granular storage with reference semantics

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 23:53:49 -07:00
ywkang eb792e6212 Remove xbar/noc remnants, rule-based cube-view connectors
- Delete xbar.py and noc.py (TwoDMeshNocComponent) — unused since router mesh
- Remove xbar_v1/noc_2d_mesh_v1 from components.yaml
- Fix pe_to_xbar → pe_to_router in routing exclusion set
- Fix xbar_to_hbm_bw_gbs → hbm_to_router_bw_gbs in report.py
- Update all docstrings/comments referencing xbar/bridge → router mesh
- Cube-view connectors: rule-based _connector_points helper
  - PE↔router: single diagonal line (not chevron)
  - UCIe N/S: 45°→horizontal→45°
  - UCIe E/W: 45°→vertical→45°
  - HBM ports: 45°→horizontal→45°

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 23:59:12 -07:00
ywkang 7640635f90 M_CPU/SRAM placement via pos_mm in topology.yaml (nearest router)
Component placement uses mm coordinates in topology.yaml, mesh_gen
finds the nearest router automatically. M_CPU moved to pos_mm=[7.5,2.0]
(→ r0c2), SRAM at pos_mm=[1.5,9.0] (→ r3c0).

No hardcoded router references in topology config.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 00:48:20 -07:00
ywkang 3ea4fa90f8 Cube-view: increase 45° stub length and component gap for visibility
Stub length increased to 12px (PE/HBM) and 10px (UCIe).
Gap between router and component increased to 30px so both
45° stubs (router end + component end) are clearly visible.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 00:38:27 -07:00
ywkang 5125d92c17 Cube-view: M_CPU north, 45° stub-straight-stub connector pattern
- M_CPU placed north (above) its router
- All connectors: 45° stub from router → straight → 45° stub to component
- Consistent 4-point polyline pattern for PE, M_CPU, SRAM, HBM, UCIe

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 00:34:48 -07:00
ywkang 72acc5c8bb Cube-view: UCIe flush against cube edges
UCIe position calculated with minimal inset (0.3 × size) to
place components flush against cube boundary edges.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 00:28:58 -07:00
ywkang bde76ec959 Cube-view: 45° diagonal from router, then straight to component
All connectors now start with 45° diagonal from router edge,
then go straight (vertical/horizontal) to the component block.
Applies to PE, M_CPU/SRAM, PE→HBM, and UCIe connectors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 00:25:41 -07:00
ywkang d3de982ea4 Cube-view: 90° router mesh links, 45° component connectors
Router-router mesh links remain straight (horizontal/vertical).
All component→router connectors use 45° L-bend polylines:
- PE blocks: vertical then 45° diagonal to router
- M_CPU/SRAM: horizontal then 45° diagonal to router
- PE→HBM port group: vertical then 45° diagonal
- UCIe port→router: direction-aware 45° bend

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 00:20:28 -07:00
ywkang df81835d84 Cube-view: UCIe position/size from topology.yaml (ucie_mm.size=2.0)
UCIe components placed at defined positions from _cube_local_positions
with size from cube.geometry.ucie_mm.size. N/S horizontal, E/W vertical.
Connection ports rendered as color-coded boxes inside UCIe component.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 00:11:11 -07:00
ywkang 66ec6cd40c Cube-view: UCIe components inside cube boundary with port boxes
- UCIe-N/S/E/W drawn as component blocks inside cube boundary
  (inset 3mm from edge)
- Each UCIe has c0-c3 connection ports as color-coded boxes inside
- Connector lines from each port box to its attached router
- Removed old UCIe rendering that placed blocks outside cube

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 23:58:32 -07:00
ywkang e766163a25 Cube-view: HBM pseudo channel ports on edges, UCIe flush to cube border
- HBM pseudo channel ports split to top/bottom edges of HBM zone
  (32 ports each, 8 per PE, color-coded)
- PE→HBM lines connect router to its port group center
- Per-PE label: "PE0×8ch" with BW annotation
- UCIe blocks flush against cube edges at router positions
- UCIe blocks smaller (22×10px)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 22:38:10 -07:00
ywkang 24faf2e1d4 Cube-view: angle HBM lines, offset M_CPU/SRAM blocks
- HBM connection lines angled 30% toward HBM center (not vertical)
  to distinguish from mesh links
- M_CPU/SRAM blocks placed to the left of their router
  with horizontal connector lines (avoid mesh overlap)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 22:30:56 -07:00
ywkang 7cd30e106e Fix Router→HBM_CTRL lines visibility in cube_view
Draw HBM connection lines last (on top of component blocks).
PE routers: thicker (1.5px, opacity 0.6) with dashed style.
Relay routers: thinner (0.7px, opacity 0.2).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 22:25:40 -07:00
ywkang 109c9b4483 Cube-view: draw all attached components as separate blocks
All router-attached components (PE, M_CPU, SRAM, UCIe) rendered as
labeled blocks with explicit connector lines to their router.
UCIe blocks positioned at cube edges matching port direction.
Router→HBM_CTRL lines shown for all 32 routers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 22:09:08 -07:00
ywkang e94f1de078 Cube-view SVG: detailed topology validation rendering
- Dedicated cube_view renderer showing 6×6 router grid with attachments
- PE blocks drawn next to their router (above/below)
- HBM pseudo channel port bar (64 ports, color-coded by PE owner)
- Per-PE BW annotations on HBM links
- Router color-coded by type (PE/M_CPU/SRAM/UCIe/relay)
- Title shows mode, channel count, per-PE and total BW
- Legend for all component types

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 22:03:38 -07:00
ywkang 5c6abe6d12 Reduce SRAM/UCIe/M_CPU/HBM node sizes, thin HBM and mesh links
Shrink cube-view component nodes to avoid clutter.
HBM and router_mesh edge lines made thinner and more transparent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 21:51:41 -07:00
ywkang f298e3c7cc Offset PE nodes in cube_view to avoid overlapping routers
PE nodes are shifted 1.2mm above (top half) or below (bottom half)
their assigned router position. PE size reduced to 1.4x0.7mm.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 18:50:32 -07:00
ywkang 91085733ba Show individual routers in cube_view SVG, fix row Y overlap
- cube_view now renders all 32 router nodes from cube_mesh.yaml
  instead of collapsed "router_mesh" placeholder
- Fix mesh_gen row Y position overlap (r1/r2 and r3/r4 had same Y)
  by adding hbm_gap spacing between PE rows and HBM zone
- Add noc_router to visualizer KIND_SIZE for proper sizing
- Update cube view tests for individual router nodes

339 passed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 18:22:38 -07:00
ywkang d2c92b8a18 Wire PE_MMU to router mesh for MmuMapMsg delivery
Add router → PE_MMU edge so MmuMapMsg can reach PE_MMU via
the router mesh. Unskip all PE_MMU fabric tests.

339 passed, 0 skipped

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 18:10:42 -07:00
ywkang 08256c1326 Fix cross-SIP PE_TCM access by scoping deploy to target_device SIP
RuntimeContext._ensure_allocators() now limits SIP range to
target_device (single SIP or all). Prevents cross-SIP tensor
deployment that caused PE_TCM routing errors.
Also accept 'sip0' format (without colon) in DeviceSelector.

331 passed, 8 skipped

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 18:03:11 -07:00