kernbench2

Author	SHA1	Message	Date
mukesh	fc4747668e	paper(§5): add roofline, capacity planning, parallelism selection subsections Three new subsections at the top of §5, motivated by the deployment-sizing questions the analytical visualization tool exposes: - 5.1 Roofline Analysis — B, L, short vs long context batch effect (measured: 19× per-token cost reduction at short context, 28% at long) - 5.2 Capacity Planning — 3-axis sizing (weights / KV / SLO), SLO targets, regime rules, deployment templates, per-axis playbook - 5.3 Parallelism Selection — TP × CP × PP × DP × EP comparison, symptom-driven axis selection, add/stop criteria, misconceptions Restructure: - 05-gqa.tex trimmed to section header + intro - Existing 5.4-5.7 content (placement, short/long ctx, composite) moved to 05x-fused-kernel.tex - Summary extracted to 05z-summary.tex, cross-refs updated - main.tex \input order sets the requested subsection sequence - lmodern loaded to satisfy microtype font-expansion Two new figures generated from tests/analytical_visualization/chip_roofline: - roofline_short_context.png (S_kv=8K, batch wins) - roofline_long_context.png (S_kv=1M, batch stalls at KV floor) - Generator: tests/analytical_visualization/_gen_roofline_paper_figs.py 4 tables (regime rules, deployment templates, playbook, parallelism criteria) use table* so they span both columns without overflow. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-30 15:56:41 -07:00
mukesh	bfe3e0e8d1	analytical-viz: parallelism rules — Section 11 TP vs CP shared-budget New Section 11 'TP vs CP: the shared-budget tradeoff' on the Parallelism rules tab. Captures the core insight that fixing TP·CP doesn't equalize the two configurations — larger TP additionally divides model-weight traffic, larger CP only divides the sequence dimension. Five sub-tables: 11a. Same TP·CP = 16 → different weight sharding (TP=8/CP=2 reads 4× fewer weight bytes per PE than TP=2/CP=8 at the same KV division) 11b. What each dimension actually shards (TP·CP·PP·DP × Weights·KV·Attention-math) 11c. When to prefer TP vs CP (6 situations mapped to the winning axis with reasoning) 11d. CP communication profile (prefill vs decode) — bytes/hop, overlap potential, real cost, notes on kv vs qoml variants 11e. Decode cost decomposition mental model: T_decode = T_weight_read(TP) + T_KV_read(TP,CP) + T_TP_comm(TP) + T_CP_merge(CP) Reframes the 'is CP heavier than TP' question as 'which cost dominates T_decode, and which axis attacks it?'. Smoke test asserts the section title appears exactly once. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-30 09:26:22 -07:00
mukesh	55653cbb3e	analytical-viz: new 'Parallelism rules' tab New last tab distilling the parallelism-selection discussion into tables. Framed up-front as heuristics + hard constraints, not a production procedure — with public counterexamples (DeepSeek-V3 PP16 no TP; vLLM PP-on-non-NVLink) that break the naive 'start TP at 8' rule. Ten sections, all table-first: 1. The core principle (chattiest comm on fastest links) 2. Parallelism techniques × comm profile (DP/TP/PP/CP/EP with Shards, Pattern, Frequency, Per-link volume as degree ↑, Hard ceiling) 3. Dominant problem → first axis to try (11 rows: weight fit, activation memory, single-seq KV, many-seq KV, MoE experts, TPOT, aggregate throughput, multi-node, expert layer, global batch cap) 4. When to add / when to stop, per axis (DP/TP/PP/CP/EP/FSDP/SP) 5. Common misconceptions vs reality (9 rows: KV-head hard ceiling, always start TP at 8, EP comm falling, attn-DP shards KV for free, DP > CP always, PP inside domain worthless, memory is only feasibility, FSDP as substrate, pipeline bubble = (P-1)/m) 6. Inference decision procedure (7 ordered steps) 7. Inference cases → recommended shape (A–E: fits on 1 GPU / within NVLink / multi-node / extreme context / many moderate) 8. Training baseline recipe (6 ordered steps) 9. Sanity checks for any candidate config (7 checks) 10. One-paragraph rule + honest caveat block All content is st.markdown + st.dataframe — no new pure functions. Smoke test asserts the tab title + `with tab_prules:` block are both present in app.py. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-30 09:11:08 -07:00
mukesh	5ce7a5b30b	analytical-viz: capacity planning — symbol legend + plain-English formulas Section 1 rewritten so a layman can read it without a glossary: 1. Symbol legend table listing every letter that appears in the three-axis formulas (N, b, N·b, HBM_per_PE, S_kv, kv_bpt, users_per_replica, TPOT SLO, step_latency, B_at_SLO, n_users, N_replicas, ⌈ ⌉). Each row: Symbol / Plain English / Value now for the current sidebar model + chip. For example: - N = 'Total number of model parameters (attention weights + FFN weights, across ALL layers)' → 6.98 billion params for Llama 3 8B - kv_bpt = 'KV cache bytes per token per user, summed across all layers (= 2·H_kv·d_head·b·layers)' → 128 KB/token 2. Three-axis table widened from (Formula / Meaning / Grows with) to (In symbols / In plain English / What it says / Value now). Axis A now shows the actual substituted arithmetic: ⌈ 13.96 GB / 6.0 GB ⌉ = ⌈ 2.33 ⌉ = 3 PEs. B and C flag 'see calculator' because they depend on user inputs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-30 08:35:59 -07:00
mukesh	bc5e704572	analytical-viz: Physical Layout — TTFT & TPOT full-model latency section New section at the bottom of the Physical Layout tab that sums the per-stage latencies into end-to-end metrics for the current sharding: - TPOT = one decode-step time × N_layers (per output token) - TTFT = one prefill pass time × N_layers (for a prompt of the input length, computed in prefill mode) - E2E = TTFT + (N_out − 1) × TPOT - Throughput/replica = B / TPOT (tokens/sec) Two number inputs: prompt length (default = sidebar S_kv) and target N_out tokens (default 100). Four KPI cards + a breakdown table with per-layer and all-layer µs/ms for attention/FFN in both modes. Both TTFT and TPOT are computed by cloning the sidebar TopologyConfig with mode swapped ('decode' for TPOT, 'prefill' for TTFT with T_q = prompt length), then reusing all_stages / all_ffn_stages — no new math. Caveat noted in caption: PP=1 assumption (no microbatch overlap / bubble modeling). Sanity: Llama 3 8B on CP=2/TP=8/default SIP → TPOT ≈ 4.2 ms/token, TTFT ≈ 1.1 s for an 8k prompt. Cross-checkable on the Per-stage latency tab. Smoke test in test_stage_shapes asserts the section title appears exactly once and sits inside tab_layout (before tab_memory). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-29 15:51:06 -07:00
mukesh	c39bbb16aa	analytical-viz: capacity planning tab — add public-example B/replica table Section 6 (new): 'Actually-provisioned batch sizes (public examples)'. Ballpark B/replica figures for Gemini 1.5 Pro/Flash 1M, GPT-4.1 1M, Claude 3.5 Sonnet 200k, and enterprise dedicated tiers, with the inference basis for each ('pricing ~2x standard suggests low B', etc). Caption flags that none are officially published; numbers are backed out from public pricing + SLAs. Binding-axis playbook renumbered #6 → #7. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-29 15:44:35 -07:00
mukesh	77abc95d78	analytical-viz: capacity planning tab — add rules of thumb + deployment templates Two more table sections on the Capacity planning tab: Section 4 — Practical rules of thumb (by context regime) - Short (S_kv < L): pack HBM, B ~ 2·B, high util, cheap tier - Long (S_kv > L): fewer users/replica, low B, util drops as 1/(1 + S_kv/L), pricier - Extreme (S_kv >> L): dedicated pool, heavy CP, disaggregated prefill, very low util without sparse attention Columns: Regime, Batch strategy, Utilization, Cost/token, Deployment. Section 5 — Sample deployment templates* Same base model, three different sharding recipes routed to by the API gateway based on request context length: - Config_small : CP=1, TP=8, PP=1 → 8 GPUs, up to 32k, B=64 - Config_medium: CP=4, TP=8, PP=1 → 32 GPUs, up to 128k, B=32 - Config_large : CP=32, TP=8, PP=1 → 256 GPUs, up to 1M, B=4 Columns: Tier, CP, TP, PP, Total GPUs/replica, Max context, Typical B, Best for. Playbook section renumbered to #6. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-29 15:13:08 -07:00
mukesh	f54822d247	analytical-viz: move capacity planning to its own tab, table-first layout New 'Capacity planning' tab (last one, after 'Chip roofline & B'). Everything below is a dataframe table where a bullet list could have gone, matching the user's 'mostly try to make tables to describe the information' preference: 1. Three-axis GPU sizing formula* (table: A/B/C rows with Formula, Meaning, 'Grows with' columns) 2. Live sizing calculator (KPI cards + 3 number inputs — this is the only interactive block on the tab) 3. SLO standards by workload (table: TTFT / TPOT targets for voice/code/chat/agentic/batch, with 'what binds' column) 4. Hybrid deployment layers (table: Layer 1 elastic pool / Layer 2 length-tier routing / Layer 3 disaggregated prefill/decode with Serves / Key mechanism / Admission rule / Typical config) 5. Binding-axis playbook (table: for each of A/B/C — Symptom, First lever, Second lever, Cost impact) Removed the sizing calculator + 3 expanders from the bottom of the Chip Roofline tab. Roofline tab is back to pure chip-vs-model / formulas territory; planning tab has the workload / SLO / deployment material. Calculator uses the base sidebar chip (_default_machine) rather than the roofline tab's knob-scaled chip, so results are stable across navigation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-29 15:09:58 -07:00
mukesh	b7c4c680aa	analytical-viz: roofline tab — 'How hyperscalers pick GPU count' section New section at the bottom of the Chip Roofline tab implementing the three-axis GPU count decision: N_GPUs = max(capacity_floor, KV_headroom, throughput_SLO) × N_replicas Interactive inputs: concurrent users, avg context/user, TPOT SLO. Four KPI cards report each axis + the recommended total, with a 'binds:' delta chip showing which axis is the bottleneck. - Axis A (capacity): PEs to hold weights alone - Axis B (KV): PEs to hold weights + all KV of one replica's users - Axis C (throughput): replicas needed at B_at_SLO users each - Total: max(A,B) × N_replicas Contextual caption explains what to do about the binding axis. If SLO is infeasible (even B=1 exceeds SLO), an error block explains the options (loosen SLO, shorten context, or scale up FLOPs/BW knobs). Plus three collapsed expanders explaining the hybrid deployment pattern hyperscalers use: - Layer 1: one elastic pool + PagedAttention + continuous batching - Layer 2: length-tier routing (standard vs long-context) - Layer 3: disaggregated prefill/decode (DistServe, Splitwise) Pure additions in chip_roofline.py: - max_batch_within_slo(machine, model, s_kv, slo_s) — analytical inverse of step_latency to find the largest per-replica B under SLO - size_deployment(machine, model, n_users, avg_ctx, slo_s) — returns GpuSizingResult with all three axes + binding info 6 new tests cover: capacity scales with model size, KV grows with users/context, binding axis flips with workload shape, tight SLO shrinks max batch, weight-time-exceeds-SLO returns 0, total == per-replica × replicas. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-29 15:07:20 -07:00
mukesh	6410c0843c	analytical-viz: roofline tab — fix NameError from knob-rename refactor Previous commit renamed slider vars (_rf_flops_mult -> _flops_mult_now, etc.) and split _ai/_b_star into base vs scaled, but the AI-sensitivity plot's vertical markers + caption still referenced the old names, so loading the tab crashed with: NameError: name '_rf_flops_mult' is not defined Fixed all four call sites: - _axAI1: 'Base AI' reference line now uses _ai_base (was _ai, which is now the scaled value). - _axAI2: 'Base B*' reference line uses _b_star_base; vertical markers use _flops_mult_now / _bw_mult_now. - Caption 'Your knob position' uses the new names. Grep confirms no _rf_flops_mult / _rf_bw_mult / _ai_scaled / _b_star_scaled references remain. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-29 14:51:01 -07:00
mukesh	97e765b58f	analytical-viz: roofline knobs — absolute TFLOPS / GB/s, drive whole tab Two changes: 1. Knob units and ranges - FLOPs slider: absolute TFLOPS, 1..32 (step 0.5). Was 0.25..8× multiplier. - HBM BW slider: absolute GB/s, 128..1024 (step 16). Was 0.25..8× multiplier. Defaults still initialize to the sidebar Hardware panel's base chip. Caption also shows the implied × multipliers of the base chip. 2. Knobs now drive every roofline number on the tab. Previously FLOPs/BW only affected the AI-sensitivity plot. Now the scaled machine feeds: - AI / B* / L* KPI cards - Regime KPI - Good-B / Good-L cards + utilization - Headline plots (step latency + cost/token) - Regime formulas table (t_com, t_mem_short, t_mem_long values) - Formula table (C, W, knee substitution) Left unchanged (intentionally): - PE memory budget stacks — depend on HBM capacity, not BW/FLOPs. - AI-sensitivity plot itself — sweeps around the SIDEBAR base chip so the curves are comparable across knob positions. Header caption now shows both effective chip (from knobs) and base chip (from sidebar) side by side. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-29 14:47:30 -07:00
mukesh	bda76cbd66	analytical-viz: roofline tab — drop formula from step-latency header Revert the multi-line formula annotation on the 'Latency per decode step' header; make it a single-line markdown title matching the 'Cost per token' header on the right. Both headline plots now render at the same height so the plot area lines up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-29 14:14:24 -07:00
mukesh	a1c3c28f62	analytical-viz: roofline tab — PE memory budget, AI sensitivity, FLOPs/BW knobs Three additions to the Chip Roofline tab: 1. PE memory budget section (two side-by-side stacked-area plots) - Left: per-PE memory vs S_kv (128..1M, log). Stacks: weights, KV, transient. HBM budget line + current-S_kv marker. - Right: per-PE memory vs B (1..256). Same stacks, current-B marker. KV grows linearly with either axis; weights are invariant of B and S_kv (fixed by CP·TP·PP sharding). Uses actual sharded per-PE quantities from memory_layout so the numbers match what the deployment would really allocate. 2. AI sensitivity section (two side-by-side line plots) - Left: AI vs chip-parameter multiplier. Two curves: scale FLOPs (AI grows linearly), scale HBM BW (AI shrinks inversely). - Right: B* vs multiplier. Same shape as AI (for BF16 where B=AI). Vertical markers show current FLOPs/BW knob positions. 3. Two new knobs* in the top row: FLOPs × multiplier and HBM BW × multiplier (0.25..8, step 0.25). The AI-sensitivity plot markers move with them; the header caption shows scaled AI + B* for the selected point. Also: formula annotation on the headline 'Latency per decode step' plot header showing t_step = N·b/W + 2·N·B/C + B·S_kv·kv_bpt/W. Pure additions in chip_roofline.py (all testable): - MemoryBudgetPoint dataclass - memory_budget_curve_vs_skv, memory_budget_curve_vs_batch - AISensitivityPoint dataclass - ai_sensitivity_curve(machine, model, mults, axis='flops'\|'bw') 9 new tests cover linear scaling of KV with S_kv/B, weight invariance across sweeps, over-budget flag flip, free_gb clamp, FLOPs multiplier → AI linear, BW multiplier → AI inverse, B* tracks AI for BF16, bad axis raises. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-29 14:10:02 -07:00
mukesh	9b4aee83da	analytical-viz: roofline tab — trim to 4 plots, cap B at 256 / S_kv at 1M - Slider caps: B_max = 256, S_kv_max = 1M (matches practical decode operating ranges). - Batch axis on every plot capped at 256 to match the slider. - Removed the two redundant plots (old Plot 1 and old Plot 4). Both were re-drawing the same weight/compute/KV decomposition as the new headline 'Cost per token' plot at the top — the only distinguishing bits (extra 'Memory total' curve, B* marker) have been folded into the headline plot already, so keeping them was pure clutter. - Every remaining plot now reacts to the S_kv / B knobs: * Plot A (step latency): S_kv drives curves, B marker drawn. * Plot B (cost/token): same. * Plot 2 (no-knee overlay): S_kv drives the L* ratios shown, B marker drawn. * Plot 3 (knee vs context): S_kv AND B markers drawn on both axes. Net: 4 plots (headline pair + no-knee + knee-vs-context), all live- interactive against both sliders. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-29 13:52:40 -07:00
mukesh	221097ef08	analytical-viz: roofline tab — headline step-latency + cost-per-token plots, B knob Two new plots at the top of the Chip Roofline tab (right after the knob row), side by side: 1. Latency per decode step — one forward pass's total time as B grows. weight_fetch (flat in B), compute (linear), KV (linear), total. The SLO / TTFT view — bigger B = longer step. 2. Cost per token = step latency ÷ B. weight (shrinks 1/B), compute (flat), KV (flat), total. The efficiency / cost view — bigger B (up to ~2·B) = cheaper per token. Same underlying decomposition, two divisors. Puts the classic throughput ↔ latency tradeoff in one glance. Also added a B (batch size) slider* next to the existing S_kv slider. Both plots draw a vertical marker at the current B; the regime KPI ("memory-bound / compute-bound / kv-bound") now uses the slider B, so you can sweep B live and watch the label flip when you cross B*. step_latency_curve + StepLatencyPoint added to chip_roofline; 5 new tests cover: weight_s batch-invariant, compute_s/kv_s linear in B, step_total == per_token_total × B (definitional), and step == per_token at B=1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-29 13:40:37 -07:00
mukesh	c99a238826	analytical-viz: roofline tab — S_kv knob, regime formulas, good B/L, decomp plot, split table Bundle of roofline-tab enhancements requested in-thread: 1. S_kv slider at top of tab — override the sidebar's S_kv locally so all four plots update as you drag it. Handy for exploring how the KV wall arrives without disturbing the rest of the app config. 2. Good B / Good L KPI cards — 3 metrics under the main KPI row: - Good B = 2·B* (Pope's rule of thumb) - Good L ceiling = L* (compute-friendly context ceiling) - Utilization @ current S_kv = 1 / (1 + S_kv/L) 3. Plot 4 — latency-decomposition per decode step* — five curves on one axis: - Compute (dashed, flat) - Weight fetch (triangles, shrinks 1/B) - KV fetch (squares, flat — batching doesn't help) - Memory total (weights + KV, purple) - Total (compute + memory, black bold) Makes "which term dominates at this B?" visible at a glance. 4. Regime formulas table — one row per cost term (t_com, weight fetch, KV fetch, bottleneck) × two columns (short-context vs long-context regime), plus a 'value now' column using the current S_kv slider. 5. How to pick B and S_kv — one-paragraph guidance with the numeric recommendations plugged in. 6. Split formula table — was cramming symbolic + substituted form into one cell with newlines; now has four proper columns: Symbol / Formula / With numbers / Meaning / Value. New pure functions in chip_roofline.py: t_mem_short, t_mem_long, t_com, good_batch, good_context, utilization_at. All roofline math stays in the pure module; app.py just calls them and formats. 10 new tests bring the roofline test count to 27 (all 67 tests still green): t_mem_long(L) == t_com, doubling B halves t_mem_short, good_batch scales with sparsity, utilization at 0/L/2L* returns 1.0/0.5/0.333, monotonic decrease with context. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-29 13:27:17 -07:00
mukesh	04c7d247f0	analytical-viz: roofline formula table shows substituted numeric form Each row in the 'Formulas & interpretation' table now shows both the symbolic form AND the actual numbers plugged in, on a second line in the Formula column. Row height bumped to fit two lines. Rendering: AI: C / W = 8.00e+12 / 2.56e+11 = 31.25 FLOPs/byte L: 2 · N_active · W / (C · kv_bpt) = 2 · 6.98e+09 · 2.56e+11 / (8.00e+12 · 131,072) = 3,407 tokens B_knee: B / (1 - S_kv/L*) = 31 / (1 - 8,192/3,407) = 31 / (1 - 2.404) [<= 0 -> no knee] Makes the derivation traceable at a glance without needing to plug the numbers back into the abstract formula. Purely a display change; no runtime behavior shifts, all 57 tests still pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-29 12:04:47 -07:00
mukesh	1ddd1baa7a	analytical-viz: 'Chip roofline & B' tab — AI, B, L, cost curves New tab surfaces the arithmetic-intensity story from LLM-serving practice for the current sidebar chip + model. All derived from MachineParams + ModelConfig; no new configuration. chip_roofline module (pure functions): - arithmetic_intensity = C / W (FLOPs per byte HBM BW) - critical_batch (B) = C * b / (2 * W) * sparsity - balance_context (L) = 2 N_active * W / (C * kv_bpt) - knee_batch (B_knee) = B* / (1 - S_kv/L) (None past L) - per_token_latency_curve = weight/B + compute + KV, per B - bound_regime = which term dominates now Peak roofline convention (no compute_util factor) so weight_s == compute_s exactly at B. Comm + TP/CP sharding intentionally excluded — stage_latencies is the full latency model; this is the back-of-envelope chip-vs-model view. Tab shows: - 4 KPI cards: AI, B, L, current regime at (B, S_kv) - MoE hint when preset flags MoE ('300 x sparsity' rule) - Plot 1: cost vs B at current S_kv (weight, compute floor, KV floor, total) with B marker line - Plot 2: cost curves at S_kv/L* = 0.25, 0.5, 1, 2, 5 — shows the no-knee regime past L* - Plot 3: B_knee vs S_kv — knee slides right, diverges at L* - Formulas + interpretation table test_chip_roofline covers B* against H100 reference (295), sparsity scaling, L* scaling with HBM BW, knee divergence at L*, per-token curve monotonicity + asymptote to compute+KV floor, regime classification. Smoke test guards the tab is wired in app.py. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-29 11:59:19 -07:00
mukesh	5e92b89821	analytical-viz: PE view keeps CP rank locator visible at TP>=8 The Per-PE layout diagram used to color-code + label CP ranks only when cp_placement=='pe' (CP packed intra-cube). At TP>=8 the TP dim fills the cube, cp_placement gets pushed to 'cube', the intra-cube CP branch is skipped, and the CP rank tag silently disappeared from cubes, per-PE headers, and the figure title. Users at TP>=8 had no way to tell which of the CP groups the diagram was drawing. Fix: when cp_placement=='cube' and CP>1, add - cube header suffix: '[CP rank 0 of {CP}]' - per-PE header suffix: ' \| CP=0' - figure title: 'CP={CP} (showing 1 of CP groups; others identical)' Behavior at cp_placement=='pe' is unchanged — the existing per-CP-rank palette + 'CP={rank}' label still fire. CP=1 adds no locators at all. test_pe_weight_layout covers four cases: TP=8/CP=4 (cube placement, locator appears), TP=2/CP=4 forced to pe placement (regression: per-rank labels still there, cube tag not added), CP=1 (no locators), TP=16/CP=2 (both spilled cubes carry the CP-rank-0 tag). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-29 10:19:39 -07:00
mukesh	3360be2488	analytical-viz: consolidate shape tables under Physical Layout tab Move the per-stage shape tables (attention + FFN) off the Per-stage latency tab and combine them with the weight + KV shape tables into a single 'All tensor shapes (per PE)' expander at the bottom of the Physical Layout tab. Five subsections in one place: - Attention weights (global + per-PE shape, per layer) - FFN weights (same schema) - KV cache (same schema) - Per-stage attention shapes (input / weight / output) - Per-stage FFN shapes (same) Rationale: the shape catalog belongs next to the physical-layout view that shows how the tensors are placed on hardware. The Per-stage latency tab now focuses purely on time-per-stage. Weight/KV tables stay on the Memory Breakdown tab too because the summary metrics underneath still consume the same rows. Guard test verifies the section string appears exactly once, is positioned inside tab_layout (before tab_memory / tab_stages), and that neither attn_stage_shape_rows nor ffn_stage_shape_rows is called inside the tab_stages block anymore. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-29 10:14:08 -07:00
mukesh	adc14c84af	analytical-viz: per-stage tensor shape tables (attention + FFN) New stage_shapes module reports per-PE input / weight / output tensor shapes for every attention (S1..S10, C1..C3) and FFN (F1..F5, CF1) stage. Streamlit's Stages tab picks these up below the existing latency table + bar chart as two separate dataframes ('Attention stages' and 'FFN stages'), so anyone who wants to see the exact sharded tensor dims per stage can, without cluttering the latency view. Rows mirror the conditional structure of all_stages / all_ffn_stages: S8 merge only when CP>1; C1 CP ring only in prefill with CP>1; C3 Score AR only when kv_shard_mode='split' and TP>H_kv; C2 TP AR only when TP>1; CF1 FFN AR only when the FFN divisor > 1. Batch B flows into every activation dim; weights stay B-invariant. test_stage_shapes covers column presence, conditional row insertion (decode omits C1; CP=1 omits S8; prefill+CP>1 inserts C1), batch scaling of the leading dim, and that S5 output references both T_q and S_local for the current cfg. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-29 10:05:02 -07:00
mukesh	9c5062bdfb	analytical-viz: memory-min buttons reset stale placement/mode session keys auto_suggest only explores (cp, tp, pp, cp_placement); every other TopologyConfig field is left at its dataclass default when the score is computed. The sidebar's memory-min apply used to update only cp/tp/pp/dp/cp_placement, so any stale session_state value from a previous user interaction (most damaging: tp_placement="cube") would carry over into the rerun and produce a topology the memory-min search never scored — visibly, cubes_used could balloon (e.g. Llama 3 70B Attn scope: auto_suggest picks 1 cube; stale tp_placement="cube" made the applied config span 8 cubes). Fix: after picking a Suggestion, reset tp_placement, kv_mode, cp_ring_variant, ep back to their TopologyConfig defaults, and drop the ffn_scope_label key (dynamic string; falls back to the TP+CP default index on re-render). Now the sliders + placement + mode state after apply exactly reproduce the topology auto_suggest scored. test_auto_suggest_cubes_match_default_topology asserts, across four model/skv combinations and all three scopes, that a TopologyConfig built from (cp, tp, pp, cp_placement) plus dataclass defaults has the same cubes_used the Suggestion reports. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 23:45:38 -07:00
mukesh	4ded67a443	analytical-viz: sidebar memory-min buttons are scope-aware (Attn / FFN / Attn+FFN) Replace the three memory-optimal Pareto-search buttons in the Parallelism sidebar with three scope-aware memory-min autosuggest buttons. Each button sizes for a different per-PE memory footprint: - Attn -> attention weights + KV cache (no FFN weights) - FFN/MoE -> FFN weights only (no attention, no KV) - Attn+FFN -> everything (traditional autosuggest) memory_layout: per_pe_weight_bytes and compute_memory take include_attention / include_ffn flags; KV cache is zeroed when attention scope is excluded. autosuggest: _score_candidate and auto_suggest forward the flags into compute_memory, so the smallest-fit search now respects the chosen scope. app.py: single "Apply memory-min" caption + 3 column buttons. Each button runs auto_suggest with its scope filter, snaps the sliders to the picked (CP, TP, PP), and sets the Physical Layout scope filter to match. Removes the standalone Apply memory-min button (was duplicating the Attn+FFN case) and the Pareto-search import. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 23:34:05 -07:00
mukesh	47c4f40f4d	analytical-viz: batch B in every formula string + rename 'latency-optimal' → 'memory-optimal' Two related fixes. 1) Formula strings show B alongside every other symbol. Previously the numeric FLOPs/mem_bytes/comm_bytes were scaled by B, but the printed formula strings still read like B=1 amounts. That made the per-stage table confusing at high B: numbers moved but formulas didn't. Updated every attention + FFN stage's `formula` / `flops_formula` / `mem_formula` / `comm_formula` to include B explicitly. Weight-bytes lines now say "(shared across batch)" / "(weight, B-invariant)" so it's obvious what does and doesn't scale. Stages touched: S1..S10 (attention), C1/C2/C3 (attn comm), F1..F5 (FFN), CF1 (FFN AR). 2) Renamed the button/spinner/help copy from "latency-optimal" to "memory-optimal" to match the smallest-fit pick semantics I put in place earlier. The buttons already picked smallest-fit; the labels just still said "latency-optimal" from the previous iteration. Left the Auto Hardware sensitivity chart's "latency-optimal baseline" label alone — that panel is a HW co-design view where "how fast can each HW go?" is the intended question. Verified: 24 pytest tests still pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 23:20:32 -07:00
mukesh	50aab8d591	analytical-viz: complete batch (B) scaling in remaining stages + comm Finishes the batch scaffold. All stages that produce per-request work now scale flops / activation-memory / comm-bytes by B; weight bytes stay fixed (shared across the batch). Attention block, previously partial: - S6 stage_softmax: elems + bytes × B - S8 stage_merge: flops × B; O/m/l AR bytes M × B - S9 stage_normalize: flops × B - S10 stage_wo: flops × B - C1 comm_cp_ring: decode O/m/l AR M × B; prefill K/V ring + Q+O/m/l variants both × B - C2 comm_tp_allreduce (W_O output): bytes × B - C3 comm_kv_split_allreduce (head-split scores): bytes_per_hop × B FFN block, previously untouched: - F1 stage_ffn_rmsnorm: activation × B (weight fixed) - F2 stage_ffn_gate (via _ffn_gemm): flops × B - F3 stage_ffn_up: same - F4 stage_ffn_swiglu: elems + flops × B - F5 stage_ffn_down: flops × B - CF1 comm_ffn_allreduce (batched FFN output): bytes × B Verified with a smoke check on Qwen 3 8B / 128K decode: B=1 per-layer visible latency: 363 us B=8 716 us (sub-linear — many stages stay weight-bound) B=64 4023 us (approaching linear scaling as batch dominates) All 24 pytest tests still pass at default B=1 (backward compat). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 23:10:39 -07:00
mukesh	fc1dcdde24	analytical-viz: tensor sharding — show batch B on KV cache Three additions to the tensor sharding view when batch B > 1: 1. Shape label appends "B={B}": e.g. (128, S_kv=131072, B=8). 2. Note under the KV cache: "batch B={B} => {B}x KV bytes per PE". 3. Visual "stack" — up to 5 dashed offset rectangles drawn behind the KV cache to hint at the batch stacking dimension. Capped at 5 so a big B doesn't overwhelm the diagram. 4. Title also gets B={B} between CP and FFN scope. Attention/FFN weight tensors are NOT stacked — weights are shared across the batch (correct: only activations + KV scale with B). At B=1, all four additions are no-ops so the diagram looks unchanged from before. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 22:56:48 -07:00
mukesh	fd45bd2408	analytical-viz: add sidebar Batch B selectbox + wire through all suggesters New sidebar widget in the Workload expander: Batch B (concurrent requests) — selectbox 1/2/4/…/256, default 1. Wired end-to-end: - app.py sidebar → forwards b_batch when constructing the main topo - auto_suggest(..., b=1) — passes b to _score_candidate, which builds TopologyConfig with b=b - run_auto_explore(..., b=1) — sets topo.b = b for every enumerated candidate before scoring - joint_explore(..., b=1) — forwards b to _best_parallelism_for_hw and _best_parallelism_two_stage; both set topo.b = b before scoring All button handlers (sidebar Apply-scope, Physical Layout scope, Auto Suggest tab sweep, Auto Hardware tab sweep) now pass b_batch. Combined with the earlier partial-scaffold commits (memory_layout scales KV by B; stage_rmsnorm / stage_wq / stage_wkv / stage_kv_append / _per_hop_qkT_pv scale flops + activation memory by B), changing B in the sidebar now affects reported latency and per-PE memory footprint in the visible parts of the pipeline. The remaining FFN + comm-AR stages still ignore B (they'll be next); their contribution is small for the memory-bound decode case that matters most, but latency for FFN-heavy configs at high B will be slightly under-reported until those are scaled too. Verified: 24 pytest tests pass at default B=1 (backward compat). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 22:51:56 -07:00
mukesh	674e194dc9	analytical-viz: hot-reload guard — include model_config + model_presets, ordered The reload guard didn't include model_config.py. When the batch (B) field was added to TopologyConfig, Streamlit kept the stale pre-B dataclass in sys.modules and every downstream module hit "AttributeError: 'TopologyConfig' object has no attribute 'b'". Fix: add model_config and model_presets to the reload list, and order model_config FIRST so downstream modules that reload after it pick up the new dataclass definition. Also reordered the rest so upstream dependencies (autosuggest, memory_layout, stage_latencies) reload before their consumers (auto_explore, auto_hardware). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 22:43:47 -07:00
mukesh	770aed6291	analytical-viz: Auto Suggest tab — add Top-10 by memory footprint table Adds a second table beneath the Pareto configurations table, sorted by (hbm_utilization ↑, pes_used ↑, latency ↑) — top 10 smallest-memory Pareto configs first. Columns highlight the memory story: HBM % \| weights (GB) \| KV (GB) \| PEs \| SIPs \| lat (ms) \| CP TP PP DP \| kv \| ffn \| tp_place \| cp_place Useful when memory pressure is the real binding constraint of a deployment — e.g., picking a config for a per-PE HBM-limited SIP, or spotting configs that trade small latency headroom for large memory savings. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 22:24:38 -07:00
mukesh	1bce080d79	analytical-viz: buttons pick smallest-fit Pareto point, not fastest Sidebar Attention/FFN/Attn+FFN buttons and Physical Layout tab buttons now select the SMALLEST-FIT config from the scope's Pareto set — key changed from (latency ↑, pes_used ↑, hbm_utilization ↑) to (pes_used ↑, hbm_utilization ↑, latency ↑). Answers "smallest deployment that's still Pareto-optimal for this scope" instead of "fastest deployment achievable". Left unchanged: - "Best latency" metric displays in Auto Suggest / Auto Hardware tabs still show the fastest number (informational; the metric labels say "Best latency"). - auto_hardware._best_parallelism_for_hw stays latency-min: it's inside HW co-design, where "how fast can each HW candidate go" is the primary question. Verified: 24 pytest tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 22:23:34 -07:00
mukesh	5d7ef48b14	analytical-viz: partial batch (B) scaffold — TopologyConfig field + KV / early stages WIP toward batch-size support. This first commit is behavior-preserving at the default B=1 (every new multiplication is by max(1, cfg.topo.b) = 1 today) so all existing tests pass. Follow-up commits will: - scale the remaining stages (S6/S7/S8/S9/S10 + C1/C2/C3 + FFN + FFN AR) - add a batch selectbox to the sidebar - forward b through auto_suggest / auto_explore / auto_hardware Changes so far: - TopologyConfig: new b: int = 1 field (batch size). - memory_layout.per_pe_kv_cache_bytes: * B (each concurrent request keeps its own KV cache slice). - stage_rmsnorm / stage_wq / stage_wkv / stage_kv_append / _per_hop_qkT_pv: FLOPs and activation memory scaled by B; weight bytes stay fixed (weights shared across the batch). Verified: 24 pytest tests still pass at default B=1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 22:21:37 -07:00
mukesh	2a0bc26db0	analytical-viz: compound tiebreaker — fewer PEs / lower HBM% when latency ties Every place that picks a single "best" config from the Pareto set now uses a compound sort key: (latency_ns ↑, pes_used ↑, hbm_utilization ↑). Primary = latency; when configs tie on latency (common — cp_ring variants, some placement variants often produce identical numbers), prefer smaller-footprint picks. Places updated: - app.py: sidebar Apply Attn/FFN/Attn+FFN button - app.py: Physical Layout tab Attn/FFN/Attn+FFN button - app.py: Auto Suggest tab "Best latency" metric - app.py: Auto Hardware tab "Best latency" metric (uses parallelism.pes_used + parallelism.hbm_utilization since JointScore wraps ConfigScore) - auto_hardware.py: _best_parallelism_for_hw iteration key No behavior change when there's a strict latency winner. When there are ties, the picked config uses fewer PEs and lower HBM utilization. Verified: all 24 pytest tests pass (default include_attention=True and include_ffn=True paths unchanged). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 22:16:02 -07:00
mukesh	6f61657ba4	analytical-viz: topology map — color PEs by CP rank when cp_placement=pe The topology map previously colored a whole cube (border + all PE fills) by the single cp_rank assigned to that cube in the cube→(pp,cp) mapping. Under cp_placement=cube this is right (each cube = one CP rank). Under cp_placement=pe, however, multiple CP ranks are PACKED into the same cube's PEs, so the whole-cube coloring makes every PE look identical (defaulting to cp_rank=0's palette entry — bright red). Fix: in the per-PE loop, if cp_packed = (cp_placement=="pe" and cp>1), compute each PE's own cp_rank = pe_id // tp and look up its color via _cp_color(pp_stage, pe_cp_rank, cp_size). Border still uses the cube-level color (cp_rank=0), so the outer bounding box is unchanged, but the interior PEs now show all four (or however many) CP-rank colors visibly. For cp_placement=cube: unchanged (single pe_fill per cube). Verified with a headless render at Qwen 3 8B, CP=4, TP=2, cp_placement=pe: 145 patches → 8 distinct facecolors (4 for the CP ranks in the packed cube + inactive/border tints), where before it was fewer distinct colors. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 22:07:50 -07:00
mukesh	da1909fdf2	analytical-viz: extend hot-reload guard to all viz sub-modules Previously only reloaded auto_explore / auto_hardware / autosuggest / stage_latencies / memory_layout. When we edit pe_weight_layout, tensor_sharding, topology_map, pipeline_diagram, or optimization_report, Streamlit was still holding the pre-edit versions until a full restart. Add all five to the reload list so any visualization-module edit lands on the next Streamlit rerun without needing Ctrl+C. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 22:03:42 -07:00
mukesh	ed28eea156	analytical-viz: color-code CP groups in PE layout when cp_placement=pe When cp_placement=pe packs multiple CP ranks intra-cube, the PE-level layout now: 1. Colors each PE's background by its CP rank (8-color palette wraps for cp > 8). Cp_placement=cube keeps the historical light-blue styling (only one CP rank per cube tile). 2. Adds "TP=N \| CP=M" to the per-PE header so users can read the (cp_rank, tp_rank) pair without inferring from position. 3. Fixes a head-assignment bug: _q_heads_for_pe / _kv_heads_for_pe / _per_pe_bytes used to be called with the raw PE-in-group index, which treated every PE as a distinct TP rank. Under cp_placement=pe this gave wrong Q/KV head lists for PEs beyond the first TP group (PE 2..7 with tp=2, cp=4 would ask for head slots 32..127 in a 32-head model). Now called with tp_rank = pe_id % tp, so all CP ranks in the cube share the correct head split. Verified via a headless matplotlib smoke test with CP=4/TP=2 on Qwen 3 8B: 8 PE patches + 1 cube patch → 5 distinct facecolors (cube + 4 CP ranks), no errors. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 21:59:18 -07:00
mukesh	f1cf0257d3	analytical-viz: restore Apply memory-min button An "Apply memory-min" button was removed when the 3 latency-optimal buttons were added. Consequence: after clicking any latency-optimal button, the sidebar sliders drift from the memory-min caption and there was no way to sync them back without restarting the app or manually adjusting each slider. Restore the button, positioned below the caption and above the 3 latency-optimal buttons. Clicking it snaps cp/tp/pp/dp/cp_placement to what the caption shows and clears _pl_active_scope to "full" so the Physical Layout tab stops filtering. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 21:55:29 -07:00
mukesh	afe35ee0b5	Revert "analytical-viz: default intra-cube (PE↔PE) BW to 128 GB/s" This reverts commit `84abf847c4`.	2026-07-28 21:50:17 -07:00
mukesh	84abf847c4	analytical-viz: default intra-cube (PE↔PE) BW to 128 GB/s Was 512 GB/s in the analytical viz MachineParams default. Bringing the default down to 128 GB/s to match what the modelled physical link now represents (parity with inter-cube D2D and topology.yaml). Files touched (all three needed to keep defaults coherent): - model_config.py: MachineParams.bw_intra_gbs = 128.0 (was 512.0) - app.py: sidebar selectbox default index = 0 (128 GB/s) instead of 2 (512) - auto_hardware.py: _HW_KNOB_DEFAULTS + BALANCED + COARSE bw_intra_gbs baselines shifted so cost_score at defaults remains 6.0 and the sensitivity sweep starts from the new baseline. Verified: 24 pytest tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 21:44:21 -07:00
mukesh	a7f39ade7e	analytical-viz: autosuggest prefers fewer cubes, tries cp_placement=pe Two changes to the memory-only auto_suggest so a "fits" that packs into one cube is preferred over a "fits" that spreads across multiple cubes. Suggestion dataclass gains cubes_used and cp_placement fields. _score_candidate: for each (CP,TP,PP) triple, try cp_placement="cube" (historical default: CP spans cubes) and, when CP·TP fits in one cube's PE count, also cp_placement="pe" (pack CP into intra-cube PEs). Keep the placement with fewer cubes; break ties toward "cube". auto_suggest: switch sort key from (pes_used ↑, pp ↑, tp ↑, cp ↑) to (cubes_used ↑, pes_used ↑, pp ↑, tp ↑, cp ↑). Fewer cubes wins first because a cube is the physical die-level unit; PE count is the tiebreaker. Sidebar caption now also displays cubes_used + cp_placement so the user can see the packed layout at a glance. Preset-change auto-reset also applies the picked cp_placement. Observed effect (verified): Qwen 3 8B / 128K decode: CP=4 TP=2, "pe" → 1 cube, 8 PEs (was 4 cubes with default "cube") Llama 3.1 70B / 128K: CP=4 TP=16, "cube" → 8 cubes (unchanged; TP=16 > 8 PEs/cube, can't pack) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 16:04:17 -07:00
mukesh	c2b42999d8	analytical-viz: reorder tabs — Physical layout back to first New tab order: Physical layout / Auto Suggest Parallelism / Memory breakdown / Per-stage latency / Save & compare / Auto Hardware Physical layout is what a user typically wants to see first when loading the app, so it takes the leftmost slot again. Auto Suggest Parallelism moves to second — still prominent, still usable as a first step, but doesn't push the layout view behind it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 14:26:07 -07:00
mukesh	9afeb6bc16	analytical-viz: 3 sidebar apply-buttons (Attn / FFN / Attn+FFN) Replaces the single memory-only "Apply auto-suggest" button in the sidebar Parallelism expander with three latency-optimal buttons: "Attention", "FFN/MoE", "Attn+FFN". Each clicked button runs run_auto_explore for its scope, picks the latency-minimum Pareto config, snaps the parallelism knobs to the sidebar's selectbox option sets (CP/TP/PP/DP), and loads the other knobs directly (tp_placement, cp_placement, cp_ring_variant, kv_mode, ffn_scope_label). Also sets _pl_active_scope so the Physical Layout tab's stage-table filter follows the scope automatically. The caption above the buttons still shows the memory-only autosuggest values as a reference — separately labeled "(memory-min)" to avoid confusion with the latency-optimal buttons below. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 14:24:06 -07:00
mukesh	bccc90db82	analytical-viz: Physical Layout scope filter for stage tables Clicking one of the three Physical Layout tab buttons now persists the chosen scope in session_state["_pl_active_scope"] and filters the per-stage latency tables accordingly: - attn scope → show only "Attention" stage table - ffn scope → show only "FFN" stage table - full scope → show both (default; also matches a fresh sidebar-driven config with no button click yet) Added a "Layout scope: {label}" header so the user can tell at a glance which scope's config is loaded. The pipeline diagram + topology map + weight/tensor sharding + PE layout continue to show the full model layout — those diagrams give context that stays useful even when the user is focusing on attn or FFN. Deeper filtering (e.g., attention-only pipeline stripe) can come later if needed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 14:21:27 -07:00
mukesh	1a6d52c58a	analytical-viz: force-reload our modules on Streamlit rerun Streamlit's hot-reload re-executes app.py on every interaction, but sub-modules imported from app.py stay cached in sys.modules across reruns. When we edit auto_explore.py or auto_hardware.py while Streamlit is alive, the app keeps using the pre-edit version until a full Ctrl+C + restart — leading to confusing "unexpected keyword argument" errors after a signature change. Force importlib.reload() on our own modules at the top of app.py so future signature changes land without a full restart. Only reloads if the module is already in sys.modules (first run just imports normally). Applied to: - tests.analytical_visualization.auto_explore - tests.analytical_visualization.auto_hardware - tests.analytical_visualization.autosuggest - tests.analytical_visualization.stage_latencies - tests.analytical_visualization.memory_layout Third-party modules (streamlit, matplotlib, pandas, numpy) NOT reloaded — unnecessary and slow. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 14:17:13 -07:00
mukesh	09721040ca	analytical-viz: 3 sweep buttons at top of Physical Layout tab Shortcut: click any of {Attention, FFN/MoE, Attn+FFN/MoE} → runs run_auto_explore for that scope, picks the latency-minimum Pareto config, loads it into the sidebar's session_state (cp, tp, pp, dp, tp_placement, cp_placement, cp_ring_variant, kv_mode, ffn_scope_label), then st.rerun(). Because the Physical Layout tab reads model + machine + cfg from the sidebar, the sidebar update automatically redraws the pipeline diagram, per-stage table, and everything below. No preview / parallel-display state — WYSIWYG. Ffn_scope_label reconstruction handles the dynamic "(div=…)" suffix the sidebar shows (same pattern as auto_explore + auto_hardware tabs' "Load into sidebar" widgets). Verified: app.py parses; 24 pytest tests still pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 14:11:30 -07:00
mukesh	bf5b659a3d	analytical-viz: add FFN-only scope + 3rd sweep button Both auto tabs now offer three sweep scopes via three buttons instead of two: - Run sweep — Attention → include_attention=T, include_ffn=F - Run sweep — FFN/MoE → include_attention=F, include_ffn=T - Run sweep — Attn + FFN/MoE → include_attention=T, include_ffn=T Each button caches its result under its own session_state key; the most recently clicked button drives the display. All three caches persist so users can flip between scopes without re-running. Core changes: auto_explore.py + auto_hardware.py: - New include_attention: bool = True param alongside include_ffn - _sum_visible_latency, _efficiency, score_config, run_auto_explore, compute_parallelism_sensitivity, joint_explore, compute_sensitivity, _best_parallelism_for_hw, _best_parallelism_two_stage all wired. - Attention and FFN are additive with no overlap: measured 7.35 ms (attn) + 5.50 ms (ffn) = 12.85 ms (full) for Llama 70B decode 128K. Bug fix (drive-by): the display block in both tab render functions was incorrectly nested inside the "cached ctx is stale" warning branch, so metrics/scatter/table/sensitivity/load only rendered when the cache was stale. Un-indented and removed the dead trailing else-info block. Verified: - 24 pytest tests pass (added 1 new for FFN-only scope invariants) - Smoke: Llama 70B decode 128K all three scopes produce sensible Pareto: * Attention only: 7.35 ms (14 pareto configs) * FFN / MoE only: 5.50 ms (34 pareto configs) * Attn + FFN/MoE: 12.85 ms (6 pareto configs) Sums add up exactly, confirming no overlap in stage summation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 14:09:24 -07:00
mukesh	b8e1a3322f	analytical-viz: attention-only vs attn+FFN via two sweep buttons Both auto tabs now accept a scope choice at run time: - "Run sweep — Attention" → include_ffn=False - "Run sweep — Attn + FFN/MoE" → include_ffn=True Each button runs an independent sweep and caches its result under its own session_state key. The most recently clicked button determines the displayed view; both caches persist so users can flip between the two scopes without re-running. Tabs: - Renamed "Auto Explore" → "Auto Suggest Parallelism" (accurately reflects that it only varies parallelism knobs; HW is held at the sidebar values). - "Auto Hardware" tab unchanged. - Still 6 top-level tabs; no additional tabs added. Core changes: auto_explore.py: - New include_ffn: bool = True parameter on _sum_visible_latency, _efficiency, score_config, run_auto_explore, compute_parallelism_ sensitivity. False drops all FFN stages from the summed latency. auto_hardware.py: - New include_ffn: bool = True parameter on joint_explore, compute_sensitivity, _best_parallelism_for_hw, _best_parallelism_ two_stage. Forwards to score_config. Both defaults keep existing tests byte-identical. Verified: - 23 pytest tests pass (added 3 new: attn-only latency lower, attn-only Pareto non-empty, joint HW attn-only faster than full). - Smoke: Llama 70B decode 128K: * Attn+FFN best latency: 12.85 ms (unchanged) * Attention-only best: 7.35 ms (~57% of full) * Both sensitivities top-rank bw_hbm_gbs (physics preserved). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 13:58:29 -07:00
mukesh	91b63eb9f6	analytical-viz: parallelism sensitivity chart in Auto Explore tab Adds compute_parallelism_sensitivity() + ParallelismSensitivityRow to auto_explore.py. For a baseline ConfigScore (picked from the Pareto set), sweeps each parallelism knob (CP, TP, PP, DP, EP) individually while holding others fixed. Reports latency + memory-fit per value. Sweep values start at 1 and step by 2 (multiples of 2 rather than only powers of 2), giving finer granularity for the visual than the enumerator's sparser set. CP goes up to 256 (per real-deployment scale), TP to 64, PP to 32. New UI panel in Auto Explore tab: 5 subplots (log/log), one per knob: - Solid line = latency where the config fits memory - Red X markers = infeasible (out of budget) - Dotted vertical line = baseline value User picks which Pareto row is the "baseline" via a number input; the sensitivity chart re-computes around it. Behaviour caveat noted during verification: for large PP the FFN AR can cross a SIP boundary (uses stage_latencies.py:730 sips_used tier selection), producing a step-up in latency. This is the existing model's choice, honestly reflected in the chart. Whether the FFN AR should span only one PP stage's ranks (thus not cross SIPs) is a separate discussion about stage_latencies.py. Verified: - 11 pytest tests pass (added 2 new for parallelism sensitivity) - Smoke: at Llama 70B decode 128K with baseline CP=8/TP=16/PP=1/DP=1: * CP sweep: 4 fits, 8 = baseline optimum, 16+ overshoots memory * TP sweep: 8/16 fit, 16 = baseline optimum, 32 slower (spans SIPs) * PP sweep: 1 = baseline, 2+ slower due to sips_used tier drop for FFN AR * DP sweep: same shape as PP * EP sweep: monotone decreasing (bigger EP = smaller per-PE FFN) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 13:39:51 -07:00
mukesh	55c20bd1a6	analytical-viz: add Auto Hardware Streamlit tab Consumes auto_hardware.joint_explore(). UI: - Sweep-depth radio: two_stage / balanced (default) / coarse - Run joint sweep button + spinner - 4 metric cards: HW candidates, feasible joint, Pareto count, best latency - Panel 1 (Pareto scatter): latency vs hardware cost proxy, feasible in grey, Pareto colored by PE count, dashed line connects Pareto in cost order — this is the "optimal HW config for the model" plot - Panel 2 (sensitivity bar chart): per-knob relative speedup when doubled from the baseline. Green bars, sorted biggest first; annotated with the baseline → doubled values. Answers "where to invest next?" - Panel 3 (table): sortable Pareto joint configs with parallelism + HW spec columns (PE HBM, HBM BW, TFLOPs, PE↔PE, D2D, C2C) - Load into sidebar: picks a Pareto row and syncs both the HW selectboxes (Per-PE + Interconnect sub-tabs) AND the parallelism sliders. Values only get loaded when the target is one of the selectbox options; otherwise the sidebar keeps its current value. Session-state cache under _hw_result keyed by (model, s_kv, mode, depth); if any of these drift, a warning suggests refreshing. Verified: app.py parses; auto_hardware smoke run on Llama 70B decode 128K in ~20s produces sensible HW co-design signal (HBM BW 33% speedup, everything else <1.5%). Next: verification across Qwen 3 8B, Mixtral 8x7B (Commit 6). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 13:19:13 -07:00
mukesh	6ef09dd6f5	analytical-viz: auto_hardware.py — joint HW×parallelism explore New module extends auto_explore into the hardware co-design space. For a fixed model + workload, sweeps hardware knobs (pe_hbm_gb, bw_hbm_gbs, peak_tflops_f16, bw_intra_gbs, bw_inter_gbs, bw_intersip_gbs) and, for each hardware candidate, searches parallelism for the latency-minimum that fits memory. Returns: - all_scores: every (hw, parallelism) pair that fits, sorted by latency - pareto_scores: 2D Pareto frontier on (latency ↓, cost_score ↓) - sensitivity: per-knob rel_speedup when doubled from the best-fast HW baseline. Ranks which HW knob gives the biggest speedup — a co-design signal. Three sweep depths trade coverage for time: - two_stage: 1 HW candidate (defaults) × autosuggest's memory-min parallelism. Fast (~1s), useful for the sensitivity ranking alone. - balanced: 64 HW × ~2k reduced-parallelism configs = ~130k joint evals, ~10-20s. Default UI setting. - coarse: 729 HW × ~2k configs = ~1.4M joint evals, ~2-5 min. Reduced parallelism sweep for the inner loop: CP × TP × PP × DP × kv_shard_mode (1,920 configs), other 4 knobs held at latency-friendly defaults (ffn_shard_scope='TP+CP', tp_placement='cube', cp_placement='pe', cp_ring_variant='qoml' for decode, 'kv' for prefill). Full 28,800-config auto_explore per HW would take 6+ minutes — too slow. Cost proxy: sum of (knob / knob_default). 6.0 at defaults. Not dollars — a rough capability score where higher = "more spec'd hardware". Verified: - 9 pytest tests pass: * enumeration counts match expected (1, 64, 729) * default cost_score = 6.0 * Pareto non-dominated + subset of all_scores * every knob is monotone-non-worsening when doubled * for Llama 70B decode, bw_hbm_gbs tops the sensitivity ranking (physically correct: memory-bound workload) - Smoke: Llama 70B decode 128K balanced sweep in ~20s produces 5 Pareto configs; best 7.57 ms with 1024 GB/s HBM BW. Doubling HBM BW gives 33% additional speedup; every other knob < 1.5%. Next: Streamlit UI tab consuming this in Commit 5. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 13:16:40 -07:00
mukesh	2834383700	analytical-viz: add Auto Explore Streamlit tab New tab consumes auto_explore.run_auto_explore(). UI: - Header showing current model + workload + per-PE HBM budget - Run sweep button (spinner while ~28k configs run in ~5-10s) - 4 metric cards: enumerated, feasible, pareto count, best latency - 2-panel scatter: * latency vs PEs (feasible in grey, Pareto coloured by efficiency, dashed line connects Pareto in PE order to show the trade-off curve) * HBM utilization vs latency for Pareto, coloured by PE count - Sortable Pareto table - "Load into sidebar" widget: pick a row, sets the sidebar session_state keys (cp, tp, pp, dp, tp_placement, cp_placement, cp_ring_variant, kv_mode, ffn_scope_label) and st.rerun()s so the user can flip to another tab and see the full breakdown. Session-state caches the last sweep result under _auto_explore_result; if the model/workload/HBM change without re-running, a warning suggests refreshing. Ffn_scope_label mapping is dynamic (contains substituted divisors), so the Load button reconstructs the exact label using the target cp/tp/dp values before assigning to session_state["ffn_scope_label"]. Verified: - app.py parses cleanly - All 9 pytest tests in test_auto_explore.py still pass - Smoke: matplotlib + pandas + auto_explore imports round-trip Next: verification pass across Qwen 3 8B and Mixtral 8x7B; polish. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-07-28 13:02:08 -07:00

1 2 3 4 5 ...

299 Commits