Commit Graph

35 Commits

Author SHA1 Message Date
ywkang a143925a12 PE_DMA perf: dual-peak utilisation (single-path + aggregate)
Each scenario now shows TWO bars:

  util_single    = effective_bw / single-path peak × 100
                   (peak = min bw_gbs on first issuer's path)
  util_aggregate = effective_bw / aggregate-resource peak × 100
                   (peak = max-min fair share across concurrent paths)

Aggregate peak uses a max-min fair-share computation: each concurrent
path's sustainable share on an edge is bw_gbs / usage_count, the
per-path throughput is the min share along its edges, and the aggregate
peak is the sum across paths. This produces the correct answer for both
shared-bottleneck scenarios (N paths converge on one wire → aggregate =
wire BW) and multi-lane shared resources (UCIe's 4 connections used in
parallel → aggregate ≈ 4 × per-conn BW), without enumerating max-flow.

Single-issuer (no_congestion) → util_single == util_aggregate by
definition. Congestion exposes the divergence:
  ctrl_hot_{1,2,3}, all_pe_to_pe0 → both metrics agree (one shared
                    bottleneck: r0c0→hbm_ctrl.pe0 @ 256 GB/s)
  8×PE eastbound → util_single=106 % (single conn @ 128 GB/s) but
                    util_aggregate=85 % (UCIe-W.conn0 @ 7-way shared,
                    aggregate peak ≈ 160 GB/s under the current
                    cross-cube routing that funnels via cube1.r0c0).

Verification updated to assert:
  (2) util_aggregate ≤ 100 % (effective BW can't exceed the aggregate
      resource peak, by construction).
  (3) single-issuer util_single == util_aggregate.
  (7) ucie_eastbound: util_aggregate is meaningfully smaller than
      util_single (the multi-lane peak correction is observable).

CSV grows with peak_aggregate_bw_gbs and util_aggregate_pct columns;
breakdown columns retained.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 08:53:00 -07:00
ywkang 0bf220fed0 Switch PE_DMA perf plots to Effective BW utilization
Replaces the latency-breakdown stacked bars with a single utilization
bar per scenario. Each bar shows ``effective_bw / peak_bottleneck_bw``
with both values annotated, and a horizontal "single-path peak" line at
100 %. The colour band (green ≥70 %, amber ≥40 %, red <40 %) makes the
no-congestion distance roll-off scannable at a glance.

Definitions:
  effective_bw = (total bytes transferred) / wall-clock time
    no_congestion: nbytes / total_ns
    congestion:    n_issuers × nbytes / makespan_ns  (aggregate)
  peak_bw      = min(edge.bw_gbs) on first issuer's path
  util_pct     = effective_bw / peak_bw × 100

The congestion graph shows that 8×PE eastbound exceeds 100 % of a
single-path peak (106.4 %): UCIe-N's 4 connections × 128 GB/s give
512 GB/s of aggregate eastbound capacity, so concurrent issuers across
disjoint conns sum past any single conn's 128 GB/s. The 8×PE→pe0_slice
hotspot reaches 91.7 %, almost saturating the shared r0c0→hbm_ctrl.pe0
bottleneck — the simulator's address-based PC striping + per-flit
arbitration model amortises the cost cleanly.

Self-verification updated to BW invariants:
  (1) effective BW shrinks as topological distance grows
  (2) util_pct ∈ (0, 250 %]
  (3) single-issuer util_pct ≤ 100 %
  (4) effective_bw = nbytes / total_ns for single requests
  (5) congestion aggregate BW grows monotonically with issuer count
      on the hot-target series
  (6) 8-PE all-hit-pe0 saturates ≥ 70 % of shared peak

All checks PASS at the current model.

The CSV retains all breakdown components (pe_setup, noc_mesh, ucie,
fabric, streaming, hbm_ctrl, contention) so a future replot can still
recover the latency-breakdown view without re-running the simulator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 07:59:45 -07:00
ywkang a759d58007 Add PE_DMA latency-breakdown plots + self-verification harness
scripts/plot_pe_dma_perf.py runs the simulator across six
no-congestion scenarios (SAME_CUBE_PE_LOCAL / REMOTE_BEST /
REMOTE_WORST, REMOTE_CUBE_BEST / REMOTE_WORST, REMOTE_SIP) and
five congestion scenarios (1/2/3 PE hot-target, 8-PE corresp.
cube-to-cube, 8-PE all-hit-pe0). It categorises actual total /
makespan into pe_setup, noc_mesh, ucie, fabric, streaming,
hbm_ctrl, and a contention residual using a wormhole-pipelined
model (first-flit arrival + (n_flits-1)/bottleneck + final
chunk_time).

Outputs:
  docs/diagrams/pe_dma_perf/no_congestion.png — single-PE latency
    by topological distance. Visualises monotonic growth from
    SAME_CUBE_PE_LOCAL (77 ns) up to REMOTE_CUBE_PE_REMOTE_WORST
    (573 ns) and REMOTE_SIP (409 ns).
  docs/diagrams/pe_dma_perf/congestion.png — makespan as concurrent
    issuer count grows. ctrl_hot_{1,2,3}=82/158/230 ns; 8-PE
    eastbound UCIe = 963 ns; 8-PE all-hit-pe0 = 558 ns.
  docs/diagrams/pe_dma_perf/summary.csv — raw rows for re-plotting.

Built-in --verify harness asserts:
  (1) distance monotonicity for no-congestion;
  (2) same-cube paths contain zero UCIe budget;
  (3) remote-cube/SIP paths carry positive UCIe budget;
  (4) breakdown is internally consistent (formula ≤ actual);
  (5) streaming term matches (n_flits-1) × flit_bytes /
      bottleneck_bw within 5 % for the local scenario;
  (6) congestion makespan is monotonic in issuer count;
  (7) 8-PE hotspot strictly exceeds 3-PE hotspot.

Cross-SIP gets a looser 70 % contention slack because the path
crosses two non-flit-aware (pcie_ep) boundaries that force
store-and-forward re-streaming the simple formula does not
attribute. Single-cube scenarios stay under 25 % residual.

All checks PASS at the current model (post ADR-0019 D1/D4
per-PE HBM CTRL restoration).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 01:23:42 -07:00
ywkang b8213d43a9 ADR-0019 D1/D4: per-PE HBM CTRL partitioning
Restores per-PE HBM controller partitioning that was lost in
commit 5917b34 ("Replace xbar/bridge/single-NOC with explicit
router mesh"), which had over-consolidated the per-slice HBM CTRL
into a single cube-wide ``hbm_ctrl`` connected to every router —
the opposite of what ADR-0019 D1/D4 specifies.

Builder splits ``hbm_ctrl`` into 8 ``hbm_ctrl.pe{X}`` instances per
cube, each reachable ONLY through PE_X's attaching router via the
existing ``peX.hbm`` attach metadata from cube_mesh.yaml. Cube
aggregate BW now matches the spec (8 PEs × 8 PCs × 32 GB/s =
2048 GB/s) instead of collapsing to 256 GB/s.

AddressResolver decodes the target PE from the HBM PA's hbm_offset
(``offset // slice_size``) and returns ``hbm_ctrl.pe{X}``. PathRouter
uses the existing ``_adj_local`` adjacency for same-cube PE_DMA so
the cube's own UCIe port can no longer appear as a zero-distance
shortcut between routers — local PE_DMA now traverses the mesh,
restoring the ADR-0019 D4 worked example
``PE0.pe_dma → r0c0 → … → r1c4 → hbm_ctrl``.

Tests:
- New tests/test_per_pe_hbm_partition.py: 14 tests covering
  topology shape, per-PE router exclusivity, PA resolution,
  single-hop local path, cross-PE mesh traversal, and end-to-end
  latency monotonicity. Probe CLI now reports
  pe-local < pe-same-half < pe-cross-half (was uniform 141ns).
- Existing tests updated for new node ids and replaced two
  assertions that locked in the wrong consolidation:
  test_noc_mesh.test_hbm_connects_to_all_routers and
  test_topology_compile.test_hbm_ctrl_connects_all_routers are
  now per-PE exclusivity assertions; test_routing
  .test_all_pe_hbm_equidistant becomes
  test_cross_pe_hbm_distance_increases_with_mesh_hops.
- test_ipcq_buffer_kind_locations.test_hbm_pe_hop_charged_at_large_payload
  threshold recalibrated 4000→1500 ns: the prior figure reflected
  serialization on the over-consolidated single hbm_ctrl; per-PE
  partitioning removes that artificial contention so the gap
  shrinks to the genuine PE↔HBM-hop cost.

Full suite: 645 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 01:04:30 -07:00
ywkang a44f832be5 Regenerate latency plots/diagrams for post-Phase-2c model
Allreduce + pe2pe + ipcq + pe_view auto-regenerated by test sweeps
running against the new chunk-streaming wire timing (per-flit
wormhole) — absolute numbers shift upward to reflect bottleneck-link
transit charged once per flit (instead of the previous cut-through
subtraction at HBM CTRL).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 23:24:01 -07:00
mukesh f6d262e359 Honest measured pipeline efficiency: two timing fixes
Two related issues caused measured pipeline efficiency to look
worse than the simulator's actual behavior:

1. DMA timing recorded too early. The op-log start timestamp
   for a DMA op fired when the request entered the queue, and
   the DMA channel was released as soon as the request was
   issued. Back-to-back DMAs therefore appeared to grab the
   channel simultaneously, with per-op duration drifting
   upward as queue depth grew - an artifact, not real cost.

   Fix: defer the start timestamp until after the channel is
   acquired, and hold the channel through the full HBM
   round-trip until the response returns. Per-op duration is
   now constant and equal to the actual transfer interval;
   serialization is visible as queue wait, not as inflated
   service time.

2. Sweep timing window folded in pre-composite work. The PE
   timing window spanned every PE engine record, which
   included the upfront pinned-operand DMA issued before the
   composite GEMM begins. For large-K shapes that one-shot
   load can be nearly half of the window, conflating
   operand-staging cost with composite-pipeline behavior.

   Fix: add a second window scoped to the composite pipeline
   by filtering op_log records to those tagged with a
   tile-pipeline stage; the legacy operand-load path is
   untagged and naturally excluded. For 32x3072x32 load_ref
   the window drops from 1765ns to 992ns and measured eff
   lines up with the steady-state DMA-bound stage limit
   instead of being penalized for the one-time load.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 14:19:17 -07:00
mukesh 83ea97b05f Composite GEMM: K-loop accumulator residency, pinned operands, sweep + deck
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:00:41 -07:00
mukesh 5accd98171 Add deck builder + overview-with-ref diagram scripts
scripts/build_overview_slides.py renders a 5-slide PPTX
(kernbench2_overview.pptx) summarizing architecture, model
correctness, IPCQ, allreduce, and buffer-kind tier comparison.

scripts/emit_overview_with_external_ref.py renders log-y and
broken-y variants of the allreduce overview (overview_log.png,
overview_broken.png) including a 366 µs ext-sim reference marker
at 96 KB / PE.

Also includes cube_mesh_view.png rendered from the SVG.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 18:20:54 -07:00
mukesh a563169e89 Add tl.recv_no_consume diagnostic API for apples-to-apples pe2pe plot
The pe2pe overview compared IPCQ (tl.send + tl.recv) against raw DMA
(tl.load + tl.store), but DMA is one-sided — DST never reads — while
tl.recv pays a slot-read on DST. The comparison was unfair: IPCQ
looked slower partly because it does more work.

Adds tl.recv_no_consume() — a separate, diagnostic-only entry point
that blocks for slot arrival but skips the slot-read (and bank-hop)
charge on DST. Production tl.recv is unchanged (no `consume` kwarg
on the public API), so the diagnostic flag can never accidentally
leak into real workloads.

Updates test_pe_to_pe_latency to call tl.recv_no_consume so the
overview.png shows IPCQ no-consume vs raw DMA on equal footing.
Also fixes PLOT_DIR back to docs/diagrams/pe2pe_latency_plots/
(was lost in a merge). Adds scripts/replot_pe2pe.py for label-only
re-renders without re-measuring.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 18:20:44 -07:00
mukesh 9c129d6131 ADR-0023 D9.7+: charge PE↔bank fabric hop for SRAM/HBM IPCQ slots
Cube SRAM and HBM live on the cube NoC behind router-attached links
(sram_to_router_bw_gbs=128, hbm_to_router_bw_gbs=256). Previously the
slot-IO model treated them as if they were per-PE local, so the
buffer_kind sweep showed TCM ≈ SRAM at 64 KB / PE.

pe_ipcq._handle_recv and pe_dma._handle_ipcq_inbound now charge a
PE→bank compute_drain_ns on top of the intrinsic slot-IO for SRAM/HBM.
TCM stays free of this hop. Adds an internal IpcqRecvCmd.consume field
that gates the recv-side hop+slot-IO charges (used by a follow-up
diagnostic API; default True keeps current behavior).

Post-fix at 64 KB / PE: TCM 12.0 µs < HBM 21.4 µs < SRAM 24.3 µs.
SRAM is slowest because its 128 GB/s bank link is the narrowest in
the system — narrower than HBM's 256 GB/s. The existing ordering test
is rewritten from tcm<sram<hbm to tcm<hbm<sram and a new
test_ipcq_buffer_kind_locations adds 3 invariants on the gap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 18:20:28 -07:00
ywkang 533e699299 IPCQ-DMA co-design HW design doc + fix IPCQ slot BW model
Add hardware design document (docs/ipcq-dma-codesign-hw.md) covering
PE_IPCQ high-level architecture, simulator verification, proposed HW
implementation, and alternatives analysis. Include D2 block diagrams
for baseline and proposed PE architectures.

Fix IPCQ slot-memory bandwidth parameters to match topology.yaml:
SRAM 128→512 GB/s (intrinsic BW, NoC-bottlenecked at 128),
HBM 32→256 GB/s (was per-channel, now per-PE aggregate).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 13:31:02 -07:00
mukesh ad5f01ab13 Merge origin/master: combine single-cube fast path + center-root reduce
Conflict resolution:
- intercube_allreduce.py: kept origin's `if single_cube:` early-exit
  (TP launches kernel on one cube/rank → skip intra-SIP mesh and go
  direct to inter-SIP exchange) AND replaced the multi-cube body with
  the local center-root + bidirectional reduce/broadcast (8-hop
  critical path on 4×4 vs 12 with corner root).
- tests/{allreduce,pe2pe}_latency_plots/: kept the local move to
  docs/diagrams/; dropped origin's stale content edits to the old
  paths (regenerable derived artifacts).
- docs/diagrams/pe2pe_latency_plots/summary.csv: kept local
  (post-Phase-2 + center-root values).

Origin contributions retained as-is:
- pyproject.toml: matplotlib >= 3.7 dep.
- runtime_api/distributed.py: derive effective cube_w/h from tensor
  shard placement so single-cube TP paths get cube_w=cube_h=1.
- kernel_args() now accepts optional cube_w/cube_h kwargs.

Verified post-merge:
- test_intercube_root_center.py: 2/2 (center-root multi-cube path).
- test_tp_layers.py + test_tp_mlp.py: 10/10 (single-cube TP path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 21:41:46 -07:00
mukesh 1c5752a9ec Intercube allreduce: center root + bidirectional reduce
Move the algorithmic root cube from the corner (cube_w-1,
cube_h-1) to the geometric center (cube_w//2, cube_h//2) and
have each phase converge bidirectionally so the intra-SIP
critical path drops from ~12 hops to ~8 hops on a 4×4 mesh
(left half W→E + right half E→W in row reduce; top half N→S +
bottom half S→N in col reduce; mirrored on broadcast).

Result on torus_2d 6 SIPs at 96 KB / PE on TCM:
  before (corner root)  : 22.0 µs
  after  (center root)  : 17.2 µs   (−22%)

Same shape on ring_1d (−7%) and mesh_2d_no_wrap (−12%); also
holds across SRAM and HBM (~−20% each).

Phase 1 test (test_intercube_root_center.py) asserts the
torus_2d 96 KB latency drops below 20.5 µs and that all 96
cubes still validate (correctness preserved).

Plot updates:
- overview.png: replace constant 10.6 µs theoretical line with
  user-supplied hand-derived curve (per-cube packet count =
  bytes_per_pe × 8 PEs ÷ 128 B; 1346 ns startup + 1.20 ns/pkt).
- All summary.csv numbers and per-topology PNGs regenerated.
- pe2pe_latency_plots and ipcq diagram emitter PNGs refreshed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 21:28:58 -07:00
mukesh 84a1325e5c ADR-0023 D9.7: IPCQ slot-memory latency model (TCM/SRAM/HBM)
Charge per-tier bandwidth + setup overhead at IPCQ slot WRITE
(receiver inbound DMA, in pe_dma._handle_ipcq_inbound) and slot
READ (recv consume, in pe_ipcq._handle_recv). Tier table
(common/ipcq_types.py):
  tcm  : 512 GB/s, 0 ns
  sram : 128 GB/s, 2 ns
  hbm  :  32 GB/s, 6 ns

Before this change, slot read/write was free regardless of
buffer_kind, making memory-tier choice invisible in simulated
latency. After the change, swapping buffer_kind in ccl.yaml
produces measurable per-tier separation in allreduce latency.

Tests:
  test_ipcq_buffer_kind_latency.py — three micro-tests asserting
    tcm < sram < hbm ordering, payload-scaling, and that
    buffer_kind sensitivity grows with payload (credit-only path
    stays fabric-bound).
  test_allreduce_buffer_kind_sweep.py — 12-config parametrized
    sweep emitting buffer_kind_sweep.png (3 lines, torus_2d).

conftest sessionfinish hook generalised to dispatch multiple
sweep aggregators (allreduce + buffer-kind).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 21:28:34 -07:00
mukesh 1e39214f89 Move generated diagrams to docs/diagrams/; add IPCQ diagram emitter
Plot output dirs now live under docs/diagrams/ (the canonical
"derived artifacts" location per CLAUDE.md):
  tests/allreduce_latency_plots/ → docs/diagrams/allreduce_latency_plots/
  tests/pe2pe_latency_plots/     → docs/diagrams/pe2pe_latency_plots/
  + new docs/diagrams/ipcq_diagram_plots/ with two presentation diagrams
    (ipcq_send_recv.png, ipcq_two_pe_dma.png)

New test tests/test_emit_ipcq_diagram.py renders the two IPCQ
diagrams from a static description (no simulation); it exists so
the diagrams can be regenerated reproducibly.

Path references updated in tests/test_pe_to_pe_latency.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 21:28:17 -07:00
ywkang eb792e6212 Remove xbar/noc remnants, rule-based cube-view connectors
- Delete xbar.py and noc.py (TwoDMeshNocComponent) — unused since router mesh
- Remove xbar_v1/noc_2d_mesh_v1 from components.yaml
- Fix pe_to_xbar → pe_to_router in routing exclusion set
- Fix xbar_to_hbm_bw_gbs → hbm_to_router_bw_gbs in report.py
- Update all docstrings/comments referencing xbar/bridge → router mesh
- Cube-view connectors: rule-based _connector_points helper
  - PE↔router: single diagonal line (not chevron)
  - UCIe N/S: 45°→horizontal→45°
  - UCIe E/W: 45°→vertical→45°
  - HBM ports: 45°→horizontal→45°

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 23:59:12 -07:00
ywkang 7640635f90 M_CPU/SRAM placement via pos_mm in topology.yaml (nearest router)
Component placement uses mm coordinates in topology.yaml, mesh_gen
finds the nearest router automatically. M_CPU moved to pos_mm=[7.5,2.0]
(→ r0c2), SRAM at pos_mm=[1.5,9.0] (→ r3c0).

No hardcoded router references in topology config.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 00:48:20 -07:00
ywkang 3ea4fa90f8 Cube-view: increase 45° stub length and component gap for visibility
Stub length increased to 12px (PE/HBM) and 10px (UCIe).
Gap between router and component increased to 30px so both
45° stubs (router end + component end) are clearly visible.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 00:38:27 -07:00
ywkang 5125d92c17 Cube-view: M_CPU north, 45° stub-straight-stub connector pattern
- M_CPU placed north (above) its router
- All connectors: 45° stub from router → straight → 45° stub to component
- Consistent 4-point polyline pattern for PE, M_CPU, SRAM, HBM, UCIe

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 00:34:48 -07:00
ywkang 72acc5c8bb Cube-view: UCIe flush against cube edges
UCIe position calculated with minimal inset (0.3 × size) to
place components flush against cube boundary edges.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 00:28:58 -07:00
ywkang bde76ec959 Cube-view: 45° diagonal from router, then straight to component
All connectors now start with 45° diagonal from router edge,
then go straight (vertical/horizontal) to the component block.
Applies to PE, M_CPU/SRAM, PE→HBM, and UCIe connectors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 00:25:41 -07:00
ywkang d3de982ea4 Cube-view: 90° router mesh links, 45° component connectors
Router-router mesh links remain straight (horizontal/vertical).
All component→router connectors use 45° L-bend polylines:
- PE blocks: vertical then 45° diagonal to router
- M_CPU/SRAM: horizontal then 45° diagonal to router
- PE→HBM port group: vertical then 45° diagonal
- UCIe port→router: direction-aware 45° bend

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 00:20:28 -07:00
ywkang df81835d84 Cube-view: UCIe position/size from topology.yaml (ucie_mm.size=2.0)
UCIe components placed at defined positions from _cube_local_positions
with size from cube.geometry.ucie_mm.size. N/S horizontal, E/W vertical.
Connection ports rendered as color-coded boxes inside UCIe component.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 00:11:11 -07:00
ywkang 66ec6cd40c Cube-view: UCIe components inside cube boundary with port boxes
- UCIe-N/S/E/W drawn as component blocks inside cube boundary
  (inset 3mm from edge)
- Each UCIe has c0-c3 connection ports as color-coded boxes inside
- Connector lines from each port box to its attached router
- Removed old UCIe rendering that placed blocks outside cube

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 23:58:32 -07:00
ywkang e766163a25 Cube-view: HBM pseudo channel ports on edges, UCIe flush to cube border
- HBM pseudo channel ports split to top/bottom edges of HBM zone
  (32 ports each, 8 per PE, color-coded)
- PE→HBM lines connect router to its port group center
- Per-PE label: "PE0×8ch" with BW annotation
- UCIe blocks flush against cube edges at router positions
- UCIe blocks smaller (22×10px)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 22:38:10 -07:00
ywkang 24faf2e1d4 Cube-view: angle HBM lines, offset M_CPU/SRAM blocks
- HBM connection lines angled 30% toward HBM center (not vertical)
  to distinguish from mesh links
- M_CPU/SRAM blocks placed to the left of their router
  with horizontal connector lines (avoid mesh overlap)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 22:30:56 -07:00
ywkang 7cd30e106e Fix Router→HBM_CTRL lines visibility in cube_view
Draw HBM connection lines last (on top of component blocks).
PE routers: thicker (1.5px, opacity 0.6) with dashed style.
Relay routers: thinner (0.7px, opacity 0.2).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 22:25:40 -07:00
ywkang 109c9b4483 Cube-view: draw all attached components as separate blocks
All router-attached components (PE, M_CPU, SRAM, UCIe) rendered as
labeled blocks with explicit connector lines to their router.
UCIe blocks positioned at cube edges matching port direction.
Router→HBM_CTRL lines shown for all 32 routers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 22:09:08 -07:00
ywkang e94f1de078 Cube-view SVG: detailed topology validation rendering
- Dedicated cube_view renderer showing 6×6 router grid with attachments
- PE blocks drawn next to their router (above/below)
- HBM pseudo channel port bar (64 ports, color-coded by PE owner)
- Per-PE BW annotations on HBM links
- Router color-coded by type (PE/M_CPU/SRAM/UCIe/relay)
- Title shows mode, channel count, per-PE and total BW
- Legend for all component types

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 22:03:38 -07:00
ywkang 5c6abe6d12 Reduce SRAM/UCIe/M_CPU/HBM node sizes, thin HBM and mesh links
Shrink cube-view component nodes to avoid clutter.
HBM and router_mesh edge lines made thinner and more transparent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 21:51:41 -07:00
ywkang f298e3c7cc Offset PE nodes in cube_view to avoid overlapping routers
PE nodes are shifted 1.2mm above (top half) or below (bottom half)
their assigned router position. PE size reduced to 1.4x0.7mm.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 18:50:32 -07:00
ywkang 91085733ba Show individual routers in cube_view SVG, fix row Y overlap
- cube_view now renders all 32 router nodes from cube_mesh.yaml
  instead of collapsed "router_mesh" placeholder
- Fix mesh_gen row Y position overlap (r1/r2 and r3/r4 had same Y)
  by adding hbm_gap spacing between PE rows and HBM zone
- Add noc_router to visualizer KIND_SIZE for proper sizing
- Update cube view tests for individual router nodes

339 passed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 18:22:38 -07:00
ywkang 5917b3497c Replace xbar/bridge/single-NOC with explicit router mesh (ADR-0019)
- Remove xbar_top/bot, bridge, single noc node from topology
- Each cube_mesh.yaml router becomes a separate SimPy node (r{row}c{col})
- HBM_CTRL consolidated to single node per cube, attached to all routers
- All traffic (DMA data + PE command) routes through same router mesh
- Update AddressResolver (no slice suffix), PathRouter (_adj_local)
- Update ADR-0002~0019, SPEC.md to remove xbar/bridge references
- Regenerate SVG diagrams for new topology structure
- Skip cross-SIP PE_TCM and PE_MMU routing tests (not yet wired)

326 passed, 13 skipped

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 17:51:28 -07:00
ywkang d75da439c6 Add probe CLI improvements, D2H read, UCIe/HBM tuning, BW sweep
- Probe CLI: restructured output (tables first, routes below), per-hop
  timestamps, split cross-cube into best/worst cases, D2H read section
- UCIe overhead: 1ns -> 8ns per port (16ns per crossing) to fix
  cross-cube-best < cross-half latency inversion
- HBM efficiency: added efficiency=0.8 factor to hbm_ctrl, reducing
  effective BW from 256 to 204.8 GB/s
- Multi-size BW sweep: saturation tables (4KB-1MB) for all probe cases
- Probe default data size: 4KB -> 32KB for more realistic measurements
- IOChiplet NOC + D2H topology and tests
- NOC mesh, xbar, BW occupancy components and tests
- Cube mesh visualization diagram

278 tests pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 01:16:18 -07:00
ywkang 6f43807900 commit - release 1 2026-03-18 11:47:48 -07:00