Commit Graph

60 Commits

Author SHA1 Message Date
ywkang a76487ca48 PE_DMA perf: SIP-wide scenarios + dual outputs + clearer naming
User asked to surface system-wide congestion (more accurate than
single-cube), bring back the latency-breakdown plot under a separate
filename, and rename the obscure ``streaming`` category.

Scenarios:
  Renamed all_pe_to_pe0 → all_pe_cube0_to_pe0 (clarify cube scope).
  Added two SIP-wide scenarios:
    sip_local_all     — every PE in sip0 (128 total) accesses its own
                        local slice. All paths disjoint (each PE owns
                        its own hbm_ctrl.peX), so the model should
                        scale linearly with cube count.
    sip_hotspot_pe0   — every PE in sip0 (128 total) targets
                        sip0.cube0.pe0_slice. Worst-case hotspot:
                        UCIe inbound + r0c0→hbm_ctrl.pe0 saturated.
  Each bar now carries an ``N=...`` annotation showing the issuer
  count, and the chart titles say the scope explicitly.

Effective BW + util at 16 KB:
  sip_local_all       N=128  eff= 27.2 TB/s  util_a= 83 %
  sip_hotspot_pe0     N=128  eff= 134 GB/s   util_a= 93 %
                                              (UCIe-into-cube0 saturated)

Plots:
  no_congestion.png + congestion.png        — Effective BW utilization
                                              (two bars: single vs aggregate peak)
  breakdown_no_congestion.png +
  breakdown_congestion.png                  — stacked latency breakdown
                                              (renamed from previous)
  summary.csv with columns for both views.

The visual y-cap on BW utilization is 150 %. Bars exceeding it (e.g.
sip_local_all's util_single = 10,639 %) are drawn at the cap with an
upward arrow and the real value annotated. The verification rule for
``util_single`` is loosened to ``≤ n_issuers × 100 % + 5 %`` so
massively-parallel disjoint scenarios pass.

Category renamed: ``streaming`` → ``wire_transfer``. It is the
bulk-transfer time = (n_flits − 1) × flit_bytes / bottleneck_bw — the
cost of streaming the rest of the payload through the slowest wire
after the first flit has arrived.

All checks PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 09:43:09 -07:00
ywkang a143925a12 PE_DMA perf: dual-peak utilisation (single-path + aggregate)
Each scenario now shows TWO bars:

  util_single    = effective_bw / single-path peak × 100
                   (peak = min bw_gbs on first issuer's path)
  util_aggregate = effective_bw / aggregate-resource peak × 100
                   (peak = max-min fair share across concurrent paths)

Aggregate peak uses a max-min fair-share computation: each concurrent
path's sustainable share on an edge is bw_gbs / usage_count, the
per-path throughput is the min share along its edges, and the aggregate
peak is the sum across paths. This produces the correct answer for both
shared-bottleneck scenarios (N paths converge on one wire → aggregate =
wire BW) and multi-lane shared resources (UCIe's 4 connections used in
parallel → aggregate ≈ 4 × per-conn BW), without enumerating max-flow.

Single-issuer (no_congestion) → util_single == util_aggregate by
definition. Congestion exposes the divergence:
  ctrl_hot_{1,2,3}, all_pe_to_pe0 → both metrics agree (one shared
                    bottleneck: r0c0→hbm_ctrl.pe0 @ 256 GB/s)
  8×PE eastbound → util_single=106 % (single conn @ 128 GB/s) but
                    util_aggregate=85 % (UCIe-W.conn0 @ 7-way shared,
                    aggregate peak ≈ 160 GB/s under the current
                    cross-cube routing that funnels via cube1.r0c0).

Verification updated to assert:
  (2) util_aggregate ≤ 100 % (effective BW can't exceed the aggregate
      resource peak, by construction).
  (3) single-issuer util_single == util_aggregate.
  (7) ucie_eastbound: util_aggregate is meaningfully smaller than
      util_single (the multi-lane peak correction is observable).

CSV grows with peak_aggregate_bw_gbs and util_aggregate_pct columns;
breakdown columns retained.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 08:53:00 -07:00
ywkang 0bf220fed0 Switch PE_DMA perf plots to Effective BW utilization
Replaces the latency-breakdown stacked bars with a single utilization
bar per scenario. Each bar shows ``effective_bw / peak_bottleneck_bw``
with both values annotated, and a horizontal "single-path peak" line at
100 %. The colour band (green ≥70 %, amber ≥40 %, red <40 %) makes the
no-congestion distance roll-off scannable at a glance.

Definitions:
  effective_bw = (total bytes transferred) / wall-clock time
    no_congestion: nbytes / total_ns
    congestion:    n_issuers × nbytes / makespan_ns  (aggregate)
  peak_bw      = min(edge.bw_gbs) on first issuer's path
  util_pct     = effective_bw / peak_bw × 100

The congestion graph shows that 8×PE eastbound exceeds 100 % of a
single-path peak (106.4 %): UCIe-N's 4 connections × 128 GB/s give
512 GB/s of aggregate eastbound capacity, so concurrent issuers across
disjoint conns sum past any single conn's 128 GB/s. The 8×PE→pe0_slice
hotspot reaches 91.7 %, almost saturating the shared r0c0→hbm_ctrl.pe0
bottleneck — the simulator's address-based PC striping + per-flit
arbitration model amortises the cost cleanly.

Self-verification updated to BW invariants:
  (1) effective BW shrinks as topological distance grows
  (2) util_pct ∈ (0, 250 %]
  (3) single-issuer util_pct ≤ 100 %
  (4) effective_bw = nbytes / total_ns for single requests
  (5) congestion aggregate BW grows monotonically with issuer count
      on the hot-target series
  (6) 8-PE all-hit-pe0 saturates ≥ 70 % of shared peak

All checks PASS at the current model.

The CSV retains all breakdown components (pe_setup, noc_mesh, ucie,
fabric, streaming, hbm_ctrl, contention) so a future replot can still
recover the latency-breakdown view without re-running the simulator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 07:59:45 -07:00
ywkang a759d58007 Add PE_DMA latency-breakdown plots + self-verification harness
scripts/plot_pe_dma_perf.py runs the simulator across six
no-congestion scenarios (SAME_CUBE_PE_LOCAL / REMOTE_BEST /
REMOTE_WORST, REMOTE_CUBE_BEST / REMOTE_WORST, REMOTE_SIP) and
five congestion scenarios (1/2/3 PE hot-target, 8-PE corresp.
cube-to-cube, 8-PE all-hit-pe0). It categorises actual total /
makespan into pe_setup, noc_mesh, ucie, fabric, streaming,
hbm_ctrl, and a contention residual using a wormhole-pipelined
model (first-flit arrival + (n_flits-1)/bottleneck + final
chunk_time).

Outputs:
  docs/diagrams/pe_dma_perf/no_congestion.png — single-PE latency
    by topological distance. Visualises monotonic growth from
    SAME_CUBE_PE_LOCAL (77 ns) up to REMOTE_CUBE_PE_REMOTE_WORST
    (573 ns) and REMOTE_SIP (409 ns).
  docs/diagrams/pe_dma_perf/congestion.png — makespan as concurrent
    issuer count grows. ctrl_hot_{1,2,3}=82/158/230 ns; 8-PE
    eastbound UCIe = 963 ns; 8-PE all-hit-pe0 = 558 ns.
  docs/diagrams/pe_dma_perf/summary.csv — raw rows for re-plotting.

Built-in --verify harness asserts:
  (1) distance monotonicity for no-congestion;
  (2) same-cube paths contain zero UCIe budget;
  (3) remote-cube/SIP paths carry positive UCIe budget;
  (4) breakdown is internally consistent (formula ≤ actual);
  (5) streaming term matches (n_flits-1) × flit_bytes /
      bottleneck_bw within 5 % for the local scenario;
  (6) congestion makespan is monotonic in issuer count;
  (7) 8-PE hotspot strictly exceeds 3-PE hotspot.

Cross-SIP gets a looser 70 % contention slack because the path
crosses two non-flit-aware (pcie_ep) boundaries that force
store-and-forward re-streaming the simple formula does not
attribute. Single-cube scenarios stay under 25 % residual.

All checks PASS at the current model (post ADR-0019 D1/D4
per-PE HBM CTRL restoration).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 01:23:42 -07:00
ywkang b8213d43a9 ADR-0019 D1/D4: per-PE HBM CTRL partitioning
Restores per-PE HBM controller partitioning that was lost in
commit 5917b34 ("Replace xbar/bridge/single-NOC with explicit
router mesh"), which had over-consolidated the per-slice HBM CTRL
into a single cube-wide ``hbm_ctrl`` connected to every router —
the opposite of what ADR-0019 D1/D4 specifies.

Builder splits ``hbm_ctrl`` into 8 ``hbm_ctrl.pe{X}`` instances per
cube, each reachable ONLY through PE_X's attaching router via the
existing ``peX.hbm`` attach metadata from cube_mesh.yaml. Cube
aggregate BW now matches the spec (8 PEs × 8 PCs × 32 GB/s =
2048 GB/s) instead of collapsing to 256 GB/s.

AddressResolver decodes the target PE from the HBM PA's hbm_offset
(``offset // slice_size``) and returns ``hbm_ctrl.pe{X}``. PathRouter
uses the existing ``_adj_local`` adjacency for same-cube PE_DMA so
the cube's own UCIe port can no longer appear as a zero-distance
shortcut between routers — local PE_DMA now traverses the mesh,
restoring the ADR-0019 D4 worked example
``PE0.pe_dma → r0c0 → … → r1c4 → hbm_ctrl``.

Tests:
- New tests/test_per_pe_hbm_partition.py: 14 tests covering
  topology shape, per-PE router exclusivity, PA resolution,
  single-hop local path, cross-PE mesh traversal, and end-to-end
  latency monotonicity. Probe CLI now reports
  pe-local < pe-same-half < pe-cross-half (was uniform 141ns).
- Existing tests updated for new node ids and replaced two
  assertions that locked in the wrong consolidation:
  test_noc_mesh.test_hbm_connects_to_all_routers and
  test_topology_compile.test_hbm_ctrl_connects_all_routers are
  now per-PE exclusivity assertions; test_routing
  .test_all_pe_hbm_equidistant becomes
  test_cross_pe_hbm_distance_increases_with_mesh_hops.
- test_ipcq_buffer_kind_locations.test_hbm_pe_hop_charged_at_large_payload
  threshold recalibrated 4000→1500 ns: the prior figure reflected
  serialization on the over-consolidated single hbm_ctrl; per-PE
  partitioning removes that artificial contention so the gap
  shrinks to the genuine PE↔HBM-hop cost.

Full suite: 645 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 01:04:30 -07:00
ywkang aaa1cbfaf6 ADR-0033 D6: address-based PC selection at HBM CTRL
Replaces global round-robin with deterministic address-derived PC
striping:

    pc_shift = log2(burst_bytes)
    pc_mask  = num_pcs - 1
    pc       = (flit.address >> pc_shift) & pc_mask

Each Transaction carries base_address (HBM byte offset of the first
chunk); each Flit derives its own address as base + i*flit_bytes.
HBM CTRL routes flits to PCs via this formula, replacing the
arrival-order RR pointer. Also splits the is_last wait into an
asynchronous _finalize_txn process so the worker isn't blocked on
PC commit, exposing true PC parallelism for disjoint addresses.

phyaddr.py documents the canonical bit layout (bits [10:8] for the
default burst=256, num_pcs=8 case). ADR-0033 D6 records the
derivation and the workload scenarios where address-striping
matters (strided streams, offset-disjoint parallel transfers).

Adds tests/test_hbm_address_based_pc.py: canonical bit mapping,
strided 8-way load distribution, same-address PC-0 serialization,
PC-aligned 2KB pair collision, dynamic pc_shift from burst_bytes,
and power-of-2 attr validation. Integration tests inspect
_pc_avail ledger directly: at default config UCIe's 8 ns per-txn
overhead exactly matches chunk_time, masking PC contention at the
makespan level even though the ledger correctly distinguishes the
cases.

Full suite: 631 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 00:18:46 -07:00
ywkang a44f832be5 Regenerate latency plots/diagrams for post-Phase-2c model
Allreduce + pe2pe + ipcq + pe_view auto-regenerated by test sweeps
running against the new chunk-streaming wire timing (per-flit
wormhole) — absolute numbers shift upward to reflect bottleneck-link
transit charged once per flit (instead of the previous cut-through
subtraction at HBM CTRL).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 23:24:01 -07:00
ywkang a0cccc71e8 Add HW architecture overview (Korean)
Standalone summary of the modeled hardware hierarchy and components.
Cross-references ADR-0003, 0004, 0014, 0017, 0022.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 23:23:52 -07:00
ywkang 32b29a1e5c ADR-0003/0014: generalize "router mesh" to "NOC"
NOC topology is an implementation choice (mesh, ring, crossbar, etc.).
ADR-0017 covers the current 2D mesh choice; ADRs at the system-level
shouldn't bind to that specific implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 23:23:46 -07:00
ywkang c9bd5387ac ADR-0033 D6: reorder future work by workload impact
Cycle-accurate arbitration policies (priority/iSLIP) downgraded to
"academic / specific use cases" — FIFO inbox is approximately fair
for typical similar-rate workloads (GEMM, AllReduce, data parallel).
True impact appears only for QoS modeling or per-stream tail latency
analysis under saturation.

Higher-priority items pulled forward: address-based PC selection at
HBM CTRL (directly affects multi-PE concurrent HBM contention), bank
conflict modeling, HBM scheduler, finite buffer backpressure, op_log
chunk-streaming integration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 23:21:35 -07:00
ywkang 9beb140eaa ADR-0033 D6: clarify what multi-flow merging actually models
Earlier the future-work list mentioned "multi-flow fair sharing on a
single shared link" which was confusing — each wire has a single
source, so this isn't a real gap. The actual modeling story:

- Multi-stream merging at routers IS handled via per-in_port fan_in +
  shared inbox + FIFO worker forwarding. Flits from different
  upstream streams interleave at flit granularity naturally.
- What's NOT modeled: cycle-accurate arbitration policies (priority,
  iSLIP), address-based PC selection at HBM CTRL (round-robin is
  address-blind, so size-aligned concurrent transactions hit full
  PC contention even when real-HW address striping would diverge),
  sub-flit (32B) granularity, finite buffer backpressure, and bank
  conflict modeling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 23:18:19 -07:00
ywkang c6788788a4 ADR-0033 Phase 2c-3 finish: op_log test + ADR doc reflect chunk-streaming
- test_op_log_per_transaction_not_per_flit (renamed from
  ..._records...): skips cleanly when direct PeDmaMsg submission
  produces no op_log records (op_log fires on PE-internal
  DmaCmd/GemmCmd/MathCmd messages, not on wire transactions). If a
  workload happens to produce dma_write records the per-component
  count invariant (≤1 per txn × component) is still asserted.
- ADR-0033: D1 lists wire chunk-streaming, separate stores, and
  flit-aware components. D2/D3/D4 updated for new wire model.
  D6 future work notes op_log full integration with chunk-streaming.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 23:12:50 -07:00
ywkang 5fdb6f8797 Latency model: HBM PC striping + chunk-loop drain (ADR-0033)
Previous model double-counted slow-upstream paths (e.g., 64KB via UCIe
128 GB/s was ~2x pessimistic). HBM CTRL now distributes bursts across
8 pseudo-channels via global round-robin, with per-chunk commit timing
that pipelines correctly against the bottleneck link's data arrival.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 21:59:07 -07:00
mukesh f6d262e359 Honest measured pipeline efficiency: two timing fixes
Two related issues caused measured pipeline efficiency to look
worse than the simulator's actual behavior:

1. DMA timing recorded too early. The op-log start timestamp
   for a DMA op fired when the request entered the queue, and
   the DMA channel was released as soon as the request was
   issued. Back-to-back DMAs therefore appeared to grab the
   channel simultaneously, with per-op duration drifting
   upward as queue depth grew - an artifact, not real cost.

   Fix: defer the start timestamp until after the channel is
   acquired, and hold the channel through the full HBM
   round-trip until the response returns. Per-op duration is
   now constant and equal to the actual transfer interval;
   serialization is visible as queue wait, not as inflated
   service time.

2. Sweep timing window folded in pre-composite work. The PE
   timing window spanned every PE engine record, which
   included the upfront pinned-operand DMA issued before the
   composite GEMM begins. For large-K shapes that one-shot
   load can be nearly half of the window, conflating
   operand-staging cost with composite-pipeline behavior.

   Fix: add a second window scoped to the composite pipeline
   by filtering op_log records to those tagged with a
   tile-pipeline stage; the legacy operand-load path is
   untagged and naturally excluded. For 32x3072x32 load_ref
   the window drops from 1765ns to 992ns and measured eff
   lines up with the steady-state DMA-bound stage limit
   instead of being penalized for the one-time load.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 14:19:17 -07:00
mukesh 83ea97b05f Composite GEMM: K-loop accumulator residency, pinned operands, sweep + deck
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:00:41 -07:00
mukesh 5accd98171 Add deck builder + overview-with-ref diagram scripts
scripts/build_overview_slides.py renders a 5-slide PPTX
(kernbench2_overview.pptx) summarizing architecture, model
correctness, IPCQ, allreduce, and buffer-kind tier comparison.

scripts/emit_overview_with_external_ref.py renders log-y and
broken-y variants of the allreduce overview (overview_log.png,
overview_broken.png) including a 366 µs ext-sim reference marker
at 96 KB / PE.

Also includes cube_mesh_view.png rendered from the SVG.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 18:20:54 -07:00
mukesh a563169e89 Add tl.recv_no_consume diagnostic API for apples-to-apples pe2pe plot
The pe2pe overview compared IPCQ (tl.send + tl.recv) against raw DMA
(tl.load + tl.store), but DMA is one-sided — DST never reads — while
tl.recv pays a slot-read on DST. The comparison was unfair: IPCQ
looked slower partly because it does more work.

Adds tl.recv_no_consume() — a separate, diagnostic-only entry point
that blocks for slot arrival but skips the slot-read (and bank-hop)
charge on DST. Production tl.recv is unchanged (no `consume` kwarg
on the public API), so the diagnostic flag can never accidentally
leak into real workloads.

Updates test_pe_to_pe_latency to call tl.recv_no_consume so the
overview.png shows IPCQ no-consume vs raw DMA on equal footing.
Also fixes PLOT_DIR back to docs/diagrams/pe2pe_latency_plots/
(was lost in a merge). Adds scripts/replot_pe2pe.py for label-only
re-renders without re-measuring.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 18:20:44 -07:00
mukesh 9c129d6131 ADR-0023 D9.7+: charge PE↔bank fabric hop for SRAM/HBM IPCQ slots
Cube SRAM and HBM live on the cube NoC behind router-attached links
(sram_to_router_bw_gbs=128, hbm_to_router_bw_gbs=256). Previously the
slot-IO model treated them as if they were per-PE local, so the
buffer_kind sweep showed TCM ≈ SRAM at 64 KB / PE.

pe_ipcq._handle_recv and pe_dma._handle_ipcq_inbound now charge a
PE→bank compute_drain_ns on top of the intrinsic slot-IO for SRAM/HBM.
TCM stays free of this hop. Adds an internal IpcqRecvCmd.consume field
that gates the recv-side hop+slot-IO charges (used by a follow-up
diagnostic API; default True keeps current behavior).

Post-fix at 64 KB / PE: TCM 12.0 µs < HBM 21.4 µs < SRAM 24.3 µs.
SRAM is slowest because its 128 GB/s bank link is the narrowest in
the system — narrower than HBM's 256 GB/s. The existing ordering test
is rewritten from tcm<sram<hbm to tcm<hbm<sram and a new
test_ipcq_buffer_kind_locations adds 3 invariants on the gap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 18:20:28 -07:00
ywkang 533e699299 IPCQ-DMA co-design HW design doc + fix IPCQ slot BW model
Add hardware design document (docs/ipcq-dma-codesign-hw.md) covering
PE_IPCQ high-level architecture, simulator verification, proposed HW
implementation, and alternatives analysis. Include D2 block diagrams
for baseline and proposed PE architectures.

Fix IPCQ slot-memory bandwidth parameters to match topology.yaml:
SRAM 128→512 GB/s (intrinsic BW, NoC-bottlenecked at 128),
HBM 32→256 GB/s (was per-channel, now per-PE aggregate).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 13:31:02 -07:00
mukesh ad5f01ab13 Merge origin/master: combine single-cube fast path + center-root reduce
Conflict resolution:
- intercube_allreduce.py: kept origin's `if single_cube:` early-exit
  (TP launches kernel on one cube/rank → skip intra-SIP mesh and go
  direct to inter-SIP exchange) AND replaced the multi-cube body with
  the local center-root + bidirectional reduce/broadcast (8-hop
  critical path on 4×4 vs 12 with corner root).
- tests/{allreduce,pe2pe}_latency_plots/: kept the local move to
  docs/diagrams/; dropped origin's stale content edits to the old
  paths (regenerable derived artifacts).
- docs/diagrams/pe2pe_latency_plots/summary.csv: kept local
  (post-Phase-2 + center-root values).

Origin contributions retained as-is:
- pyproject.toml: matplotlib >= 3.7 dep.
- runtime_api/distributed.py: derive effective cube_w/h from tensor
  shard placement so single-cube TP paths get cube_w=cube_h=1.
- kernel_args() now accepts optional cube_w/cube_h kwargs.

Verified post-merge:
- test_intercube_root_center.py: 2/2 (center-root multi-cube path).
- test_tp_layers.py + test_tp_mlp.py: 10/10 (single-cube TP path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 21:41:46 -07:00
mukesh 1c5752a9ec Intercube allreduce: center root + bidirectional reduce
Move the algorithmic root cube from the corner (cube_w-1,
cube_h-1) to the geometric center (cube_w//2, cube_h//2) and
have each phase converge bidirectionally so the intra-SIP
critical path drops from ~12 hops to ~8 hops on a 4×4 mesh
(left half W→E + right half E→W in row reduce; top half N→S +
bottom half S→N in col reduce; mirrored on broadcast).

Result on torus_2d 6 SIPs at 96 KB / PE on TCM:
  before (corner root)  : 22.0 µs
  after  (center root)  : 17.2 µs   (−22%)

Same shape on ring_1d (−7%) and mesh_2d_no_wrap (−12%); also
holds across SRAM and HBM (~−20% each).

Phase 1 test (test_intercube_root_center.py) asserts the
torus_2d 96 KB latency drops below 20.5 µs and that all 96
cubes still validate (correctness preserved).

Plot updates:
- overview.png: replace constant 10.6 µs theoretical line with
  user-supplied hand-derived curve (per-cube packet count =
  bytes_per_pe × 8 PEs ÷ 128 B; 1346 ns startup + 1.20 ns/pkt).
- All summary.csv numbers and per-topology PNGs regenerated.
- pe2pe_latency_plots and ipcq diagram emitter PNGs refreshed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 21:28:58 -07:00
mukesh 84a1325e5c ADR-0023 D9.7: IPCQ slot-memory latency model (TCM/SRAM/HBM)
Charge per-tier bandwidth + setup overhead at IPCQ slot WRITE
(receiver inbound DMA, in pe_dma._handle_ipcq_inbound) and slot
READ (recv consume, in pe_ipcq._handle_recv). Tier table
(common/ipcq_types.py):
  tcm  : 512 GB/s, 0 ns
  sram : 128 GB/s, 2 ns
  hbm  :  32 GB/s, 6 ns

Before this change, slot read/write was free regardless of
buffer_kind, making memory-tier choice invisible in simulated
latency. After the change, swapping buffer_kind in ccl.yaml
produces measurable per-tier separation in allreduce latency.

Tests:
  test_ipcq_buffer_kind_latency.py — three micro-tests asserting
    tcm < sram < hbm ordering, payload-scaling, and that
    buffer_kind sensitivity grows with payload (credit-only path
    stays fabric-bound).
  test_allreduce_buffer_kind_sweep.py — 12-config parametrized
    sweep emitting buffer_kind_sweep.png (3 lines, torus_2d).

conftest sessionfinish hook generalised to dispatch multiple
sweep aggregators (allreduce + buffer-kind).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 21:28:34 -07:00
mukesh 1e39214f89 Move generated diagrams to docs/diagrams/; add IPCQ diagram emitter
Plot output dirs now live under docs/diagrams/ (the canonical
"derived artifacts" location per CLAUDE.md):
  tests/allreduce_latency_plots/ → docs/diagrams/allreduce_latency_plots/
  tests/pe2pe_latency_plots/     → docs/diagrams/pe2pe_latency_plots/
  + new docs/diagrams/ipcq_diagram_plots/ with two presentation diagrams
    (ipcq_send_recv.png, ipcq_two_pe_dma.png)

New test tests/test_emit_ipcq_diagram.py renders the two IPCQ
diagrams from a static description (no simulation); it exists so
the diagrams can be regenerated reproducibly.

Path references updated in tests/test_pe_to_pe_latency.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 21:28:17 -07:00
mukesh 1c33afec55 ADR-0032 + intra_* opposite directions in IPCQ install
Add intra_N/S/E/W to install.py _OPPOSITE_DIR table so the intra-cube
PE-to-PE namespace is symmetrical with intercube N/S/E/W. ADR-0032
documents the intercube allreduce algorithm (supersedes ADR-0029).
Refresh ADR-0024/0025/0029 cross-refs and update
test_intercube_sfr_config.py to cover the new intra_* mappings. Drop
the obsolete test_ccl_round_robin_recv.py (replaced by intercube tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 16:43:01 -07:00
ywkang 81cc32c46b ADR-0001 Rev 2: 51-bit PhysAddr layout with concrete sub-unit tables
Remove rack_id (4 bits), rename sip_seg→die_id, shift fields to enable
42-bit local_offset (4 TB per die). Define PE_LOCAL/MCPU_LOCAL/CUBE_SRAM
sub-unit tables for AHBM dies and IOCPU sub-unit table for IOCHIPLET
dies (1 TB window). Supersedes ADR-0031.

Also fixes latent VA/PA confusion in pe_dma pipeline DMA path where
virtual addresses were decoded as physical addresses without MMU
translation — previously masked by coincidental bit-position alignment.

529 passed (+6 recovered), 10 pre-existing failures unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-27 15:52:29 -07:00
mukesh c1a5cf3a2a ADR-0009 D5: chain-aware target_start_ns + zero-byte launch fanout
The single-walk predictor (find_node_path(io_cpu, pe_cpu) +
compute_path_latency_ns) under-shot actual dispatch latency for far
cubes -- the routing graph could pick a path bypassing M_CPU, and
non-zero-nbytes launch sub-txns serialized on shared first hops.
Far PEs arrived at _execute_kernel after target_start_ns, silently
skipped the barrier yield, and started pe_exec_start late. Their
reported pe_exec_ns under-counted by exactly the late_ns amount
(63 ns observed at h4 cube4.pe0 in the IPCQ test, up to 113 ns
worst case for cubes 9-11), producing the suspicious flat region
in the h4 IPCQ curve at 8192/10240 bytes.

Fix:
  - IO_CPU predictor uses the explicit two-leg chain
    (IO_CPU->M_CPU + M_CPU->PE_CPU - io.overhead - m.overhead), so
    every PE on every targeted cube has a barrier >= its real
    dispatch arrival.
  - Kernel-launch fanout sub-txns carry nbytes=0 (control-plane,
    not data-plane), removing the per-cube fanout serialization
    that pushed far M_CPUs past the predictor.
  - Legacy io_cpu mirror updated.

ADR-0009 D5 mechanism updated to specify the two-leg formula and
the nbytes=0 requirement. New tests/test_d5_barrier_invariant.py
asserts (a) no PE enters _execute_kernel after target_start_ns and
(b) every PE in a multi-cube launch has identical pe_exec_start --
both regressions silently pass on the existing
tests/test_kernel_launch_sync.py because that test only inspects
post-aggregation max(pe_exec_ns).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 15:12:58 -07:00
mukesh 90874abbfe ADR-0023 D9: blocking credit-emit with full-path latency
PE_IPCQ._handle_recv now yields-from _delayed_credit_send instead of
spawning it as a fork, so the receiver's pe_exec_ns includes the
credit-return cost. _credit_latency_ns switches from
compute_drain_ns(path, 16) to compute_path_latency_ns(path, 16) and
fixes a latent find_path bug where the destination lacked the
".pe_dma" suffix (silently returned 0 ns under the bare except).

Net effect on h3/h4 inter-cube pe-to-pe latency: IPCQ >= raw DMA at
every size, matching real-HW posted-write semantics. tl.send remains
fire-and-forget. ADR-0023 D9 amended; new diagnostic test
tests/test_pe_to_pe_diagnostic.py captures per-PE pe_exec_ns, paths,
drain, and meta-arrival timing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 15:12:38 -07:00
mukesh 14d800b0ae Kernel-launch sync (ADR-0009 D5) and IPCQ drain at inbound (ADR-0023)
- KernelLaunchMsg gains target_start_ns: IO_CPU stamps a global barrier
  (max path latency across every target PE), M_CPU passes it through,
  PE_CPU yields until it before recording pe_exec_start. Every PE in a
  launch begins kernel execution at the same env.now regardless of its
  dispatch path length — eliminates per-PE dispatch-offset artifact in
  cross-PE and cross-cube latency measurements.

- PE_DMA._handle_ipcq_inbound now pays Transaction.drain_ns at the top,
  matching the terminal-drain behavior of ComponentBase._forward_txn for
  every non-IPCQ Transaction. SRC-side tl.send stays fire-and-forget
  (sender doesn't yield on sub_done); tl.recv now blocks until bytes
  have actually drained into its inbox.

- ComponentContext: new compute_path_latency_ns helper + node_overhead_ns
  field populated by GraphEngine.

- tests/test_kernel_launch_sync.py: asserts all PEs in one launch
  produce identical pe_exec_ns for a no-op kernel (zero spread).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 15:30:29 -07:00
ywkang e7f376ebaa ADR-0027 rev7 (Megatron TP + worker-wait generalization) + ADR-0026 typo fix
ADR-0027 is a design-only change (no production code). Rev 7 closes design
across 7 iterations of review. Key decisions:

- D0 (worker-wait generalization): ctx.wait in worker context yields to
  main scheduler, which drains env.run. Solves ADR-0024 Phase B orphan
  bug (ring_default_ws strict xfail). Normative contracts on resume
  invariant, fast-path, main-context non-reentrance, barrier
  loop-until-empty, and scheduler non-progress as user contract.
- D0.5 (host-read barrier): Tensor.numpy/data/__getitem__/__repr__/copy_
  auto-drain pending before reading. Closed-set via explicit registry
  (T5.g). copy_ uses global-pending barrier with explicit
  over-serialization tradeoff.
- D1 (torch.multiprocessing.spawn): real-PyTorch API-signature parity,
  cooperative greenlet scheduler internally. Explicit non-goal on
  process isolation / address space / failure isolation. Sibling
  cleanup via SystemExit + SpawnException(errors) wrapping root-cause
  ranks.
- D4/D5 (TP layers): ColumnParallelLinear / RowParallelLinear use
  torch.launch(gemm_kernel) — no host-side torch.matmul. Yield-safety
  contract normatively required for all TP forward paths.
- Supersedes ADR-0024 D7/D12/D13 as design (none landed). Source of
  truth declared normative.

Test strategy: T1-T8 with numerical-correctness primary (not mean/
aggregate-only), orphan invariant direct assertion, host-read barrier
closed-set via registry. Phase 2 acceptance = 524 passed + 0 xfail
(ring_default_ws unblocked by D0).

ADR-0026 typo fix: torch.cuda.set_device → torch.ahbm.set_device in
DPPolicy docstring (ADR-0024 D10 convention).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 14:13:26 -07:00
ywkang 357cab525b ADR-0026: DPPolicy intra-device only + ShardSpec structural coords
DPPolicy no longer carries a cross-SIP axis. SIP-level placement is
solely controlled by torch.ahbm.set_device(rank) (ADR-0024); DPPolicy
itself describes only the cube × PE layout within one SIP. ShardSpec
switches to structural (sip, cube, pe) coordinates; the flat pe_index
field/property is fully removed — silent drift between global-flat and
SIP-local interpretations was a foot-gun flagged by ADR-0024 D11.

Breaking API (explicit TypeError / AttributeError):
- DPPolicy(sip=...) / DPPolicy(num_sips=...) -> TypeError
- ShardSpec.pe_index -> AttributeError
- ShardSpec(pe_index=...) -> TypeError
- resolve_dp_policy now takes target_sip= (required), no num_sips.

Downstream migration:
- PE allocator dict keyed by (sip, cube, pe) tuples, in both
  _ensure_allocators and _free_tensor. deploy_tensor uses tuple lookup.
- _create_tensor passes target_sip=current_sip; post-hoc pe_index
  shifting removed entirely.
- launch._compute_local_shape drops the dp.sip branch.
- Internal resolvers (column_wise / row_wise / replicate / tiled_*)
  return _LocalPeShard (cube-local identifier) instead of ShardSpec —
  resolve_dp_policy lifts them to full structural coords.

Tests:
- New tests/test_adr0026_dppolicy_intra_device.py (12 tests) pins the
  contract end-to-end.
- test_sip_parallel.py rewritten: SIP composition now modeled as two
  resolve_dp_policy(target_sip=...) calls (ADR-0024 launcher style).
- Call-site migration: test_tensor, test_va_integration, test_va_offset,
  test_runtime_api_tensor, test_tl_recv_async, test_ccl_* and benches
  gemm_single_pe, gpt3_qkv, va_offset_verify, ccl_allreduce (legacy
  branch) all use intra-device DPPolicy and structural ShardSpec.

Result: 523 passed, 1 strict xfail (ring_default_ws — unchanged
ADR-0024 Phase B blocker; architectural fix deferred to ADR-0027).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 13:02:19 -07:00
ywkang e1084800ab docs: add ADRs 0024–0031 for SIP-TP launcher stack
ADR-0024 (SIP-level TP launcher): rank = SIP abstraction, engine-routed
  install, mp.spawn parity, epoch barrier, ShardSpec structural coords.
ADR-0025 (IPCQ direction addressing): address-based matching for meta
  arrival and credit return; fixes 2-rank bidirectional ring deadlock.
ADR-0026 (DPPolicy intra-device only): remove sip/num_sips fields;
  ShardSpec uses structural (sip, cube, pe); pe_index property removed.
ADR-0027 (Megatron-style TP API): ColumnParallelLinear / RowParallelLinear
  on top of ADR-0024 launcher. Backlog until 0024/0025/0026 land.
ADR-0028 (DTensor support): stub / future work.
ADR-0029 (Hierarchical all-reduce): 3-level reduce using all_pes mapper
  and multi_pe_sip_local validator from ADR-0024. Backlog.
ADR-0030 (IPCQ PhysAddr integration): blocked on ADR-0031.
ADR-0031 (PhysAddr PE-resource extension): stub; local_offset range-based
  partition approach; specific ranges TBD.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 00:38:27 -07:00
ywkang b2c52f0e34 Add English translations for ADR-0018, 0019, 0020, 0021
- ADR-0018: LA-based memory address abstraction + BAAW + HBM channel mapping
- ADR-0019: CUBE NOC per-channel and aggregated HBM connection model
- ADR-0020: 2-pass data execution model (timing/data separation, greenlet)
- ADR-0021: PE pipeline refactor (component separation + token self-routing)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 16:31:32 -07:00
ywkang 998cc85762 Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)
Major changes:

PE-level IPCQ infrastructure:
- New PE_IPCQ component: ring-buffer control plane with 4-direction
  neighbor mapping, head/tail pointers, backpressure (poll/sleep).
- PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA,
  including in-flight data snapshot (D9) and op_log recording at
  outbound time for Phase 2 replay correctness.
- IpcqDmaToken piggyback model: data + metadata travel together,
  atomic visibility at receiver (invariant I6).
- Credit return fast path: bottleneck-BW latency, no fabric vc_comm.

Phase 2 data execution (ADR-0020 integration):
- op_log extended: DmaWriteCmd now captures src_space/src_addr for
  Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time.
- DataExecutor replays dma_write + ipcq_copy in t_start order.
- Engine._flush_data_phase: incremental cursor-based replay after
  each engine.wait() so host reads see post-Phase-2 data.
- KernelRunner Phase 1 writes disabled when op_log is active to
  prevent stale data from corrupting the MemoryStore snapshot.

TLContext / kernel API:
- tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype),
  tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode.
- TensorHandle operator overloading (add/sub/mul/div) via thread-local
  active TLContext → MathCmd dispatch through PE_MATH.
- PE-local scratch allocator for math output handles.
- tl.load returns space="hbm" handles for correct Phase 2 addressing.
- Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv.

Unified ccl_allreduce bench (PyTorch-compat host code):
- Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch)
  split matching real PyTorch DDP worker pattern.
- torch.distributed facade: init_process_group, get_world_size,
  get_rank, get_backend, all_reduce, barrier — only real PyTorch names.
- AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches
  kernel via tensor shard metadata (n_elem from shards[0].nbytes).
- world_size derived from topology spec (sips × cubes × pes_per_cube)
  with optional algorithm-level override in ccl.yaml.

Tensor API (PyTorch-compat surface):
- Tensor.numpy(): gather-aware (all shards via VA-based addressing).
- Tensor.copy_(source): scatter from host tensor into sharded target.
- RuntimeContext.from_numpy(arr): host-side staging tensor.
- Tensor.data property fixed to use numpy() (was shards[0]-only).

Algorithm modules moved to src/kernbench/ccl/algorithms/:
- ring_allreduce, mesh_allreduce, tree_allreduce, hello_send.
- Each module exports kernel_args(world_size, n_elem) helper.
- ccl.yaml module paths updated to kernbench.ccl.algorithms.*.

Dead code removed:
- 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.).
- _run_ccl_bench greenlet-per-SIP scheduler.
- benches.loader.is_ccl_bench + run_rank detection.
- benches/ccl/ directory.

Tests:
- New test_ccl_allreduce_matrix.py: 7 parametrized cases
  (ring×3 buffers, ring 8/16, mesh 4, tree 7).
- New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests.
- Existing tests updated for new import paths + world_size_override.

Docs:
- Korean ccl-author-guide.md and ADR-0023 paths updated.
- New English versions: ccl-author-guide.en.md, ADR-0023.en.md.

502 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 19:36:59 -07:00
ywkang ff2c677a9c Add 2D grid program_id semantics (ADR-0022)
tl.program_id(axis=0) returns local PE id within cube,
tl.program_id(axis=1) returns cube id. Enables cube-aware
sharding in benchmark kernels.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 16:49:56 -07:00
ywkang 161132cdcb ADR-0021: PE pipeline refactor — component separation + token self-routing
Design for refactoring pe_accel monolith into independent builtin components:
- D1: 6 independent components (scheduler, DMA, fetch_store, GEMM, MATH, TCM)
- D2: Token self-routing — scheduler only dispatches + tracks completion
- D3: done signal = simpy.Event (HW wire), data = message (queue)
- D4: Async pipeline with single FIFO feeder, command-level ordering
- D5: PE_FETCH_STORE separates TCM↔register from compute
- D6: Compute components implement _process() only, chaining in base
- D7: Topology adds pe_fetch_store + chaining edges
- D8: Existing builtin/pe_accel → builtin_legacy backup, new builtin
- D9: TileToken with plan + stage_idx for self-routing

Key decisions from review:
- No PipelineManager object — scheduler + existing ports sufficient
- PipelineContext with exactly-once completion contract
- _feed_loop singleton per scheduler, FIFO command ordering
- Intra-PE chaining: no explicit latency model
- Latency models ported from pe_accel current implementation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 23:21:40 -07:00
ywkang 140b85436a ADR-0020: 2-Pass data execution model with greenlet kernel runner
Design for actual data storage/computation in HBM/TCM/SRAM components:
- Phase 1: SimPy timing + MemoryStore (memory ops data-aware via greenlet)
- Phase 2: op_log-based numpy execution for GEMM/Math verification
- Greenlet-based KernelRunner replaces Phase 0 command list generation
- tl.load() returns real data in Phase 1, enabling memory-based control flow
- ComponentBase hook for op logging (single source of truth)
- MemoryStore: numpy ndarray tensor-granular storage with reference semantics

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 23:53:49 -07:00
ywkang eb792e6212 Remove xbar/noc remnants, rule-based cube-view connectors
- Delete xbar.py and noc.py (TwoDMeshNocComponent) — unused since router mesh
- Remove xbar_v1/noc_2d_mesh_v1 from components.yaml
- Fix pe_to_xbar → pe_to_router in routing exclusion set
- Fix xbar_to_hbm_bw_gbs → hbm_to_router_bw_gbs in report.py
- Update all docstrings/comments referencing xbar/bridge → router mesh
- Cube-view connectors: rule-based _connector_points helper
  - PE↔router: single diagonal line (not chevron)
  - UCIe N/S: 45°→horizontal→45°
  - UCIe E/W: 45°→vertical→45°
  - HBM ports: 45°→horizontal→45°

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 23:59:12 -07:00
ywkang 7640635f90 M_CPU/SRAM placement via pos_mm in topology.yaml (nearest router)
Component placement uses mm coordinates in topology.yaml, mesh_gen
finds the nearest router automatically. M_CPU moved to pos_mm=[7.5,2.0]
(→ r0c2), SRAM at pos_mm=[1.5,9.0] (→ r3c0).

No hardcoded router references in topology config.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 00:48:20 -07:00
ywkang 3ea4fa90f8 Cube-view: increase 45° stub length and component gap for visibility
Stub length increased to 12px (PE/HBM) and 10px (UCIe).
Gap between router and component increased to 30px so both
45° stubs (router end + component end) are clearly visible.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 00:38:27 -07:00
ywkang 5125d92c17 Cube-view: M_CPU north, 45° stub-straight-stub connector pattern
- M_CPU placed north (above) its router
- All connectors: 45° stub from router → straight → 45° stub to component
- Consistent 4-point polyline pattern for PE, M_CPU, SRAM, HBM, UCIe

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 00:34:48 -07:00
ywkang 72acc5c8bb Cube-view: UCIe flush against cube edges
UCIe position calculated with minimal inset (0.3 × size) to
place components flush against cube boundary edges.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 00:28:58 -07:00
ywkang bde76ec959 Cube-view: 45° diagonal from router, then straight to component
All connectors now start with 45° diagonal from router edge,
then go straight (vertical/horizontal) to the component block.
Applies to PE, M_CPU/SRAM, PE→HBM, and UCIe connectors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 00:25:41 -07:00
ywkang d3de982ea4 Cube-view: 90° router mesh links, 45° component connectors
Router-router mesh links remain straight (horizontal/vertical).
All component→router connectors use 45° L-bend polylines:
- PE blocks: vertical then 45° diagonal to router
- M_CPU/SRAM: horizontal then 45° diagonal to router
- PE→HBM port group: vertical then 45° diagonal
- UCIe port→router: direction-aware 45° bend

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 00:20:28 -07:00
ywkang df81835d84 Cube-view: UCIe position/size from topology.yaml (ucie_mm.size=2.0)
UCIe components placed at defined positions from _cube_local_positions
with size from cube.geometry.ucie_mm.size. N/S horizontal, E/W vertical.
Connection ports rendered as color-coded boxes inside UCIe component.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 00:11:11 -07:00
ywkang 66ec6cd40c Cube-view: UCIe components inside cube boundary with port boxes
- UCIe-N/S/E/W drawn as component blocks inside cube boundary
  (inset 3mm from edge)
- Each UCIe has c0-c3 connection ports as color-coded boxes inside
- Connector lines from each port box to its attached router
- Removed old UCIe rendering that placed blocks outside cube

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 23:58:32 -07:00
ywkang e766163a25 Cube-view: HBM pseudo channel ports on edges, UCIe flush to cube border
- HBM pseudo channel ports split to top/bottom edges of HBM zone
  (32 ports each, 8 per PE, color-coded)
- PE→HBM lines connect router to its port group center
- Per-PE label: "PE0×8ch" with BW annotation
- UCIe blocks flush against cube edges at router positions
- UCIe blocks smaller (22×10px)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 22:38:10 -07:00
ywkang 24faf2e1d4 Cube-view: angle HBM lines, offset M_CPU/SRAM blocks
- HBM connection lines angled 30% toward HBM center (not vertical)
  to distinguish from mesh links
- M_CPU/SRAM blocks placed to the left of their router
  with horizontal connector lines (avoid mesh overlap)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 22:30:56 -07:00
ywkang 7cd30e106e Fix Router→HBM_CTRL lines visibility in cube_view
Draw HBM connection lines last (on top of component blocks).
PE routers: thicker (1.5px, opacity 0.6) with dashed style.
Relay routers: thinner (0.7px, opacity 0.2).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 22:25:40 -07:00
ywkang 109c9b4483 Cube-view: draw all attached components as separate blocks
All router-attached components (PE, M_CPU, SRAM, UCIe) rendered as
labeled blocks with explicit connector lines to their router.
UCIe blocks positioned at cube edges matching port direction.
Router→HBM_CTRL lines shown for all 32 routers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 22:09:08 -07:00
ywkang e94f1de078 Cube-view SVG: detailed topology validation rendering
- Dedicated cube_view renderer showing 6×6 router grid with attachments
- PE blocks drawn next to their router (above/below)
- HBM pseudo channel port bar (64 ports, color-coded by PE owner)
- Per-PE BW annotations on HBM links
- Router color-coded by type (PE/M_CPU/SRAM/UCIe/relay)
- Title shows mode, channel count, per-PE and total BW
- Legend for all component types

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 22:03:38 -07:00