Add 5 of the 6 figure renderers ADR-0057 D3 sub-cycle 4c specifies:
- gqa_op_log_{panel}.png × 4 — per-panel bar chart of the 5 op_log
counts (gemm, ipcq_send, ipcq_recv, dma_read, dma_write).
- gqa_comparison.png — cross-panel grouped bars over the same 5 series.
Sixth figure (gqa_scaling.png) depends on sub-cycle 4b's Q/cube ∈
{1, 2, 4} sweep on multi_user_* panels and is deferred until that
data exists; emit_all_gqa_plots returns just the 5 in-scope paths.
Add MILESTONE_FAST=1 mode to run(): skip the panel sweep, reuse the
committed sweep.json, render figures only. Validation mode unchanged.
The runtime errors clearly when neither env var is set, listing the
two supported modes.
Renderers live in the bench module (the milestone-1h-gemm pattern);
tests/gqa/_gqa_plot_helpers.py re-exports them for figure tests.
Tests: tests/gqa/test_plot_gqa_figures.py — 7 tests, all green:
- 4 parametrized per-panel emit assertions
- 1 comparison emit assertion
- 1 emit_all returns exactly 5 PNG paths
- 1 default out_dir matches the bench _OUTPUT_DIR
Commits the 5 PNG baselines under the bench output dir alongside
sweep.json, mirroring milestone-1h-gemm's committed-figures pattern.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-contained eval bench (ADR-0054) that drives the four GQA Llama-70B
panels through run_bench with enable_data=True at validation scale and
emits sweep.json with the v1 schema (ADR-0057 D7).
Panel dispatch table maps each panel to (kernel, SFR install, S_q,
n_ranks, rank_axis):
single_user_prefill mesh_kv_kernel, intracube_pe_ring, S_q=16, n=8, rank_axis=0
multi_user_prefill mesh_kv_kernel, intercube_multisip, S_q=16, n=4, rank_axis=1
single_user_decode mesh_mlo_kernel, intracube_pe_ring, S_q=1, n=8, rank_axis=0
multi_user_decode mesh_mlo_kernel, intercube_multisip, S_q=1, n=4, rank_axis=1
multi_user panels pass _auto_dim_remap=False (avoid d_head=64
colliding with K's global M=64) and rank_axis=1 (cube-level ring,
gates 7 of every 8 PEs to silence).
Each panel runs on a fresh per-config GraphEngine, then op_log is
summarized into gemm/dma/ipcq counts. Both decode panels emit exactly
2*n_ranks GEMMs (one-shot partial attention per rank, ADR-0056 D3).
v1 supports GQA_VALIDATION=1 only; headline mode + figures deferred to
sub-cycles 4b/4c. Sentinel tensor satisfies the run_bench
"at least one request" contract (ADR-0045 D4 / ADR-0054 D2 carve-out).
Tests: tests/attention/test_milestone_gqa_llama70b.py — all 12 pass.
Includes committed sweep.json baseline at the bench's _OUTPUT_DIR so
subsequent test runs reuse it instead of re-simulating.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ADR-0059 single_user_* panels run the ring across PEs in one cube
(rank == tl.program_id(axis=0)). multi_user_* panels run the ring
across cubes — rank should be cube_id (axis=1), and 7 of every 8 PEs
in each cube must stay silent because the cube-level SFR install only
gives the cube-coordinate PE 0 an E/W neighbor.
Add ``rank_axis: int = 0`` kwarg to both ``attention_mesh_mlo_kernel``
and ``attention_mesh_kv_kernel``:
- 0 (default): rank == tl.program_id(axis=0). Existing single_user
behavior, all spec tests unchanged.
- 1: gate ``if tl.program_id(axis=0) != 0: return`` at kernel start,
then ``rank = tl.program_id(axis=1)``. multi_user_* panels pass
this to the kernel via ctx.launch positional arg.
Also brings in _attention_mesh_kv.py and _attention_mesh_mlo.py as
the committed home of the ADR-0059 kernels (previously living
uncommitted in the working tree from sub-cycle 4b).
Tests: 7-test rank_axis spec file (default-path + rank_axis=1 gating
and cube-id semantics, both kernels); 4-panel diag harness now green
end-to-end (single_user_prefill/decode + multi_user_prefill/decode);
763-test wider sweep clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two compounding bugs in ctx.launch's dim-translation path surfaced
by multi_user_* panels of milestone-gqa-llama70b (sub-cycle 4c step 2):
Bug A: _compute_local_shape divided by self._num_cubes (the topology's
cube count, 16 in default topology.yaml) instead of the DPPolicy's
effective num_cubes (4 for validation-scale multi_user). The tensor
allocator at context.py:471-484 already honored dp.num_cubes; the
parallel computation inside launch was out of sync. Fix mirrors the
allocator's eff_num_cubes precedence pattern.
Bug B: dim_map was keyed by value, so any scalar whose value
coincidentally equaled a global tensor dim got rewritten to that dim's
local value — e.g. d_head=64 colliding with K's global M=64 in
multi_user mode. Legacy bench kernels (va_offset etc.) rely on this
remap, so the fix is opt-out: ctx.launch(..., _auto_dim_remap=False)
preserves scalars exactly as passed. Default remains True.
Tests: 3 new dim-translation tests + 4-panel diag harness covers
single_user_* (PASS) and multi_user_* (advances to new SFR/axis layer
failure, tracked separately). va_offset + full attention spec suite
unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 cannot snapshot math-output sources at outbound send time
because math executes only in Phase 2 — so token.data stays None and
PE_DMA inbound can't write the recv slot. For own-sends this is harmless
(Phase 2 replay reads the stable scratch addr after math runs). For
forwarded sends in mesh kernels (ADR-0059), src_addr is a recv slot
that gets wrapped by later inbounds before this read's Phase 2 turn,
yielding a shape mismatch on the fallback MemoryStore.read.
Fix: DataExecutor maintains a per-slot, time-ordered, shape-keyed
history. Every ipcq_copy write appends (t_write, value) to the slot's
history; _resolve_read falls back to the most recent shape-matching
entry with t_write <= the consuming op's t_start. Applied uniformly
to _execute_memory, _execute_gemm, and _execute_math.
Secondary: OpLogger.record_end for math ops now prefers
TensorHandle.data carried by the input handle over a MemoryStore
re-read, closing the smaller record-end race covered by the new
test_op_log_input_snapshot_race.py unit tests.
Tests: 4 new race tests + 6 existing op_log + mesh decode diag +
mesh kv/mlo spec — all green. Full repo sweep: 760 passed (3
pre-existing failures unrelated: bench-registry list drift +
Windows Tkinter env).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per request, the milestone bench output is now tracked in git instead of
gitignored, so the figures/results are viewable on the remote:
- src/kernbench/benches/1H_milestone_output/gemm/ (3 PNGs + gemm_sweep.json)
- src/kernbench/benches/1H_milestone_output/ccl/ (3 per-topology PNGs,
buffer-kind PNG+CSV, FSIM comparison PNG, topology.png, summary.csv)
Drop the .gitignore rule; update ADR-0054 D3 + Negative (EN+KO) to say the
output is committed (regenerable by rerunning the bench). Artifacts produced
by full bench runs (milestone-1h-gemm non-FAST, milestone-1h-ccl).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the GEMM + allreduce sweep/render logic out of scripts/ and tests/
into two self-contained eval benches so a user can regenerate every
result + figure with one command:
kernbench run --bench milestone-1h-gemm (MILESTONE_FAST=1 reuses JSON)
kernbench run --bench milestone-1h-ccl
- benches/milestone_1h_{gemm,ccl}.py: single home for each domain; the
run(torch) entry drives the sweeps and writes figures into
benches/1H_milestone_output/{gemm,ccl}/ (gitignored), then submits a
sentinel tensor to satisfy the run_bench contract.
- tests/gemm + tests/sccl helpers and scripts/gemm_sweep.py become thin
re-export/wrapper shims over the benches (single source preserved); the
pytest-only param builders + _run_distributed wrapper stay in the shim.
- eval-bench pattern: a bench may drive many configs + build its own
per-config engines (extends ADR-0045 D5; reverses ADR-0044 D1/D2).
ADR-0054 (EN+KO) records the design; ADR-0043/0044/0045 + CLAUDE.md CLI
Semantics amended; ADR INDEX regenerated. Verified: milestone benches run
clean (ok=True, all artifacts), full suite 67 passed, lang-pairs OK.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Convert the multidevice allreduce correctness + latency/buffer-kind sweeps
to run through the real PyTorch-distributed path
(init_process_group(backend="ahbm") -> mp.spawn -> dist.all_reduce) instead
of direct ctx.launch, and reorganize the CCL/allreduce tests into a
tests/sccl/ package split one test per file.
Production change (required for the distributed path on non-square SIP grids):
- AhbmCCLBackend now reads explicit system.sips.w/h from the spec, with a
square-only sqrt fallback that raises on ambiguity, instead of silently
guessing round(sqrt(count)). This fixes the 2x3 / 3x2 torus + mesh cases,
which previously resolved to a wrong 2x2 grid. Mirrors the test helper's
_sip_topo_dims precedence (explicit w/h > square fallback > raise).
Test reorganization (tests/sccl/):
- _allreduce_helpers.py: shared plumbing (distributed driver, config writers,
direct-launch run_allreduce parity reference, sweep/buffer-kind constants,
plot aggregators, topology-diagram + FSIM-comparison emitters).
- test_allreduce_ring_torus_mesh.py: correctness across ring/torus/mesh.
- test_distributed_default_topology.py: full distributed path on topology.yaml.
- test_plot_latency_sweep.py / test_plot_buffer_kind_sweep.py: sweep rows.
- test_plot_topology_diagram.py / test_plot_comparison_fsim.py: plot emitters.
- test_intercube_root_center.py: moved in (ADR-0032 center-root latency guard).
Also:
- Move the FSIM comparison plot generator out of scripts/ into the sccl suite.
- Delete superseded test files (test_allreduce_multidevice,
test_distributed_lrab_hierarchical_allreduce, test_allreduce_buffer_kind_sweep)
and repoint conftest aggregators + the ipcq buffer-kind importers.
- Regenerate the allreduce_latency_plots derived artifacts from the full sweep.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rename the intercube all-reduce identity to lrab_hierarchical_allreduce
(module, config key, distributed test) so the name reflects both levels
it implements: LRAB intra-SIP (local reduce to center root + broadcast)
and the hierarchical inter-SIP topology exchange (ring/torus/mesh).
ADR-0032 slug kept as the stable decision id; pure rename, no logic change.
Also in this batch:
- ADR-0032 (EN+KO): document the shipped center-root bidirectional reduce
(doc was stale corner-root); annotate ccl.yaml root_cube as a placeholder.
- Rename allreduce + pe2pe latency plots to descriptive, title-matching
filenames and retitle the in-plot headings; drop overview/overview_log.
- Point the PPTX image refs at the new plot names.
Doc + derived-artifact + rename only; no simulation behavior changed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace BenchResult.engine: object | None with GraphEngine | None via
TYPE_CHECKING import (avoids circular import at runtime). Cast the
topology argument to TopologyGraph at the GraphEngine call site for
the duck-typed engine_factory. Fixes Pylance reportAttributeAccessIssue
warnings on op_log and topology arg. Type annotations only; no runtime
behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move benches/ -> src/kernbench/benches/ and src/kernbench/cli/probe.py ->
src/kernbench/probes/probe.py. Each bench self-registers via
@bench(name=..., description=...); kernbench list enumerates benches
with auto-assigned indices, --bench accepts kebab-case name or numeric
index. Audit at package-import time fails if any non-underscore module
forgets the decorator. ADR-0010 (EN + KO) updated to reflect the new
resolver path, list subcommand, and probes package separation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extend tl.composite() with an ordered epilogue list. Each op carries
a scope flag - output_tile (default, runs once per (m,n) before
STORE), k_tile (every K-tile right after GEMM), or kernel. Plan
generator slots MATH stages by scope; pe_math reuses pe_dma's
local-loop pattern so chained epilogues (bias->relu) skip the port
hop. op_log captures per-stage params for telemetry. Topology
gains a gemm->math edge (snapshot test updated).
API stays backward-compatible - `epilogue=` is opt-in.
Example:
h = tl.composite(
op="gemm", a=a, b=b, out_ptr=int(out),
epilogue=[
{"op": "dequant", "scale": s_per_k, "scope": "k_tile"},
{"op": "bias", "bias": bias_vec},
{"op": "relu"},
{"op": "scale", "factor": 0.5},
],
)
tl.wait(h)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restores per-PE HBM controller partitioning that was lost in
commit 5917b34 ("Replace xbar/bridge/single-NOC with explicit
router mesh"), which had over-consolidated the per-slice HBM CTRL
into a single cube-wide ``hbm_ctrl`` connected to every router —
the opposite of what ADR-0019 D1/D4 specifies.
Builder splits ``hbm_ctrl`` into 8 ``hbm_ctrl.pe{X}`` instances per
cube, each reachable ONLY through PE_X's attaching router via the
existing ``peX.hbm`` attach metadata from cube_mesh.yaml. Cube
aggregate BW now matches the spec (8 PEs × 8 PCs × 32 GB/s =
2048 GB/s) instead of collapsing to 256 GB/s.
AddressResolver decodes the target PE from the HBM PA's hbm_offset
(``offset // slice_size``) and returns ``hbm_ctrl.pe{X}``. PathRouter
uses the existing ``_adj_local`` adjacency for same-cube PE_DMA so
the cube's own UCIe port can no longer appear as a zero-distance
shortcut between routers — local PE_DMA now traverses the mesh,
restoring the ADR-0019 D4 worked example
``PE0.pe_dma → r0c0 → … → r1c4 → hbm_ctrl``.
Tests:
- New tests/test_per_pe_hbm_partition.py: 14 tests covering
topology shape, per-PE router exclusivity, PA resolution,
single-hop local path, cross-PE mesh traversal, and end-to-end
latency monotonicity. Probe CLI now reports
pe-local < pe-same-half < pe-cross-half (was uniform 141ns).
- Existing tests updated for new node ids and replaced two
assertions that locked in the wrong consolidation:
test_noc_mesh.test_hbm_connects_to_all_routers and
test_topology_compile.test_hbm_ctrl_connects_all_routers are
now per-PE exclusivity assertions; test_routing
.test_all_pe_hbm_equidistant becomes
test_cross_pe_hbm_distance_increases_with_mesh_hops.
- test_ipcq_buffer_kind_locations.test_hbm_pe_hop_charged_at_large_payload
threshold recalibrated 4000→1500 ns: the prior figure reflected
serialization on the over-consolidated single hbm_ctrl; per-PE
partitioning removes that artificial contention so the gap
shrinks to the genuine PE↔HBM-hop cost.
Full suite: 645 passed, 1 skipped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces global round-robin with deterministic address-derived PC
striping:
pc_shift = log2(burst_bytes)
pc_mask = num_pcs - 1
pc = (flit.address >> pc_shift) & pc_mask
Each Transaction carries base_address (HBM byte offset of the first
chunk); each Flit derives its own address as base + i*flit_bytes.
HBM CTRL routes flits to PCs via this formula, replacing the
arrival-order RR pointer. Also splits the is_last wait into an
asynchronous _finalize_txn process so the worker isn't blocked on
PC commit, exposing true PC parallelism for disjoint addresses.
phyaddr.py documents the canonical bit layout (bits [10:8] for the
default burst=256, num_pcs=8 case). ADR-0033 D6 records the
derivation and the workload scenarios where address-striping
matters (strided streams, offset-disjoint parallel transfers).
Adds tests/test_hbm_address_based_pc.py: canonical bit mapping,
strided 8-way load distribution, same-address PC-0 serialization,
PC-aligned 2KB pair collision, dynamic pc_shift from burst_bytes,
and power-of-2 attr validation. Integration tests inspect
_pc_avail ledger directly: at default config UCIe's 8 ns per-txn
overhead exactly matches chunk_time, masking PC contention at the
makespan level even though the ledger correctly distinguishes the
cases.
Full suite: 631 passed, 1 skipped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of Phase 2c-1 timing collapse identified: src.out_port and
dst.in_port aliased the same simpy.Store, so when wire chunkified a
Transaction into Flits and re-put them, fan_in could pull flits before
the wire applied bw delay — half the flits bypassed bottleneck timing.
Fix: separate Stores per directed edge. Wire is the only conduit. Each
flit on the wire incurs chunk_time = flit_nbytes/bw_gbs once, in arrival
order. Multi-hop wormhole pipelining emerges naturally because
flit-aware pass-through (TransitComponent) forwards each flit serially
without reassembly.
64 KB MemoryWrite via UCIe 128 GB/s bottleneck: 273 ns (broken) → 545 ns
(matches drain 512 + commit 8 + path overheads). 1 MB: 8230 ns (matches
drain 8192). Single-flit transfer transport-time alone, exactly what
real-HW wormhole produces.
3 pre-existing tests now off by small margins or inverted:
- test_h2d_local_cube_cut_through: 65.53 vs threshold 65.0
- test_engine_override_is_scoped_to_impl: ZeroRouter inherits
ComponentBase, not flit-aware, so override path reassembles at each
hop while default doesn't
- test_intra_sip_critical_path_at_96k_below_threshold: 96KB allreduce
microscopically over its threshold
Not weakening these to pass: they reflect model fidelity improvements
that need calibrated thresholds. To address in follow-up via test
threshold updates and ZeroRouter→TransitComponent inheritance.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire decomposes Transactions into Flits per `_flit_bytes` but emits all
flits atomically at the same env.now — preserves single-msg timing as
infrastructure for Phase 2c-2 (per-flit timing + flit-aware routers).
Non-flit-aware components reassemble Flits in `_fan_in`; `_update_step`
sets txn.step to current component's path position so legacy
step-based routing continues working when upstream is flit-aware.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous model double-counted slow-upstream paths (e.g., 64KB via UCIe
128 GB/s was ~2x pessimistic). HBM CTRL now distributes bursts across
8 pseudo-channels via global round-robin, with per-chunk commit timing
that pipelines correctly against the bottleneck link's data arrival.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related issues caused measured pipeline efficiency to look
worse than the simulator's actual behavior:
1. DMA timing recorded too early. The op-log start timestamp
for a DMA op fired when the request entered the queue, and
the DMA channel was released as soon as the request was
issued. Back-to-back DMAs therefore appeared to grab the
channel simultaneously, with per-op duration drifting
upward as queue depth grew - an artifact, not real cost.
Fix: defer the start timestamp until after the channel is
acquired, and hold the channel through the full HBM
round-trip until the response returns. Per-op duration is
now constant and equal to the actual transfer interval;
serialization is visible as queue wait, not as inflated
service time.
2. Sweep timing window folded in pre-composite work. The PE
timing window spanned every PE engine record, which
included the upfront pinned-operand DMA issued before the
composite GEMM begins. For large-K shapes that one-shot
load can be nearly half of the window, conflating
operand-staging cost with composite-pipeline behavior.
Fix: add a second window scoped to the composite pipeline
by filtering op_log records to those tagged with a
tile-pipeline stage; the legacy operand-load path is
untagged and naturally excluded. For 32x3072x32 load_ref
the window drops from 1765ns to 992ns and measured eff
lines up with the steady-state DMA-bound stage limit
instead of being penalized for the one-time load.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pe2pe overview compared IPCQ (tl.send + tl.recv) against raw DMA
(tl.load + tl.store), but DMA is one-sided — DST never reads — while
tl.recv pays a slot-read on DST. The comparison was unfair: IPCQ
looked slower partly because it does more work.
Adds tl.recv_no_consume() — a separate, diagnostic-only entry point
that blocks for slot arrival but skips the slot-read (and bank-hop)
charge on DST. Production tl.recv is unchanged (no `consume` kwarg
on the public API), so the diagnostic flag can never accidentally
leak into real workloads.
Updates test_pe_to_pe_latency to call tl.recv_no_consume so the
overview.png shows IPCQ no-consume vs raw DMA on equal footing.
Also fixes PLOT_DIR back to docs/diagrams/pe2pe_latency_plots/
(was lost in a merge). Adds scripts/replot_pe2pe.py for label-only
re-renders without re-measuring.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cube SRAM and HBM live on the cube NoC behind router-attached links
(sram_to_router_bw_gbs=128, hbm_to_router_bw_gbs=256). Previously the
slot-IO model treated them as if they were per-PE local, so the
buffer_kind sweep showed TCM ≈ SRAM at 64 KB / PE.
pe_ipcq._handle_recv and pe_dma._handle_ipcq_inbound now charge a
PE→bank compute_drain_ns on top of the intrinsic slot-IO for SRAM/HBM.
TCM stays free of this hop. Adds an internal IpcqRecvCmd.consume field
that gates the recv-side hop+slot-IO charges (used by a follow-up
diagnostic API; default True keeps current behavior).
Post-fix at 64 KB / PE: TCM 12.0 µs < HBM 21.4 µs < SRAM 24.3 µs.
SRAM is slowest because its 128 GB/s bank link is the narrowest in
the system — narrower than HBM's 256 GB/s. The existing ordering test
is rewritten from tcm<sram<hbm to tcm<hbm<sram and a new
test_ipcq_buffer_kind_locations adds 3 invariants on the gap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Conflict resolution:
- intercube_allreduce.py: kept origin's `if single_cube:` early-exit
(TP launches kernel on one cube/rank → skip intra-SIP mesh and go
direct to inter-SIP exchange) AND replaced the multi-cube body with
the local center-root + bidirectional reduce/broadcast (8-hop
critical path on 4×4 vs 12 with corner root).
- tests/{allreduce,pe2pe}_latency_plots/: kept the local move to
docs/diagrams/; dropped origin's stale content edits to the old
paths (regenerable derived artifacts).
- docs/diagrams/pe2pe_latency_plots/summary.csv: kept local
(post-Phase-2 + center-root values).
Origin contributions retained as-is:
- pyproject.toml: matplotlib >= 3.7 dep.
- runtime_api/distributed.py: derive effective cube_w/h from tensor
shard placement so single-cube TP paths get cube_w=cube_h=1.
- kernel_args() now accepts optional cube_w/cube_h kwargs.
Verified post-merge:
- test_intercube_root_center.py: 2/2 (center-root multi-cube path).
- test_tp_layers.py + test_tp_mlp.py: 10/10 (single-cube TP path).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the algorithmic root cube from the corner (cube_w-1,
cube_h-1) to the geometric center (cube_w//2, cube_h//2) and
have each phase converge bidirectionally so the intra-SIP
critical path drops from ~12 hops to ~8 hops on a 4×4 mesh
(left half W→E + right half E→W in row reduce; top half N→S +
bottom half S→N in col reduce; mirrored on broadcast).
Result on torus_2d 6 SIPs at 96 KB / PE on TCM:
before (corner root) : 22.0 µs
after (center root) : 17.2 µs (−22%)
Same shape on ring_1d (−7%) and mesh_2d_no_wrap (−12%); also
holds across SRAM and HBM (~−20% each).
Phase 1 test (test_intercube_root_center.py) asserts the
torus_2d 96 KB latency drops below 20.5 µs and that all 96
cubes still validate (correctness preserved).
Plot updates:
- overview.png: replace constant 10.6 µs theoretical line with
user-supplied hand-derived curve (per-cube packet count =
bytes_per_pe × 8 PEs ÷ 128 B; 1346 ns startup + 1.20 ns/pkt).
- All summary.csv numbers and per-topology PNGs regenerated.
- pe2pe_latency_plots and ipcq diagram emitter PNGs refreshed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add intra_N/S/E/W to install.py _OPPOSITE_DIR table so the intra-cube
PE-to-PE namespace is symmetrical with intercube N/S/E/W. ADR-0032
documents the intercube allreduce algorithm (supersedes ADR-0029).
Refresh ADR-0024/0025/0029 cross-refs and update
test_intercube_sfr_config.py to cover the new intra_* mappings. Drop
the obsolete test_ccl_round_robin_recv.py (replaced by intercube tests).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove rack_id (4 bits), rename sip_seg→die_id, shift fields to enable
42-bit local_offset (4 TB per die). Define PE_LOCAL/MCPU_LOCAL/CUBE_SRAM
sub-unit tables for AHBM dies and IOCPU sub-unit table for IOCHIPLET
dies (1 TB window). Supersedes ADR-0031.
Also fixes latent VA/PA confusion in pe_dma pipeline DMA path where
virtual addresses were decoded as physical addresses without MMU
translation — previously masked by coincidental bit-position alignment.
529 passed (+6 recovered), 10 pre-existing failures unchanged.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mesh_2d, torus_2d, and mesh_2d_no_wrap accept optional w,h kwargs;
sqrt fall-back preserved for square layouts (back-compat tests
confirm 4-SIP and 9-SIP square configs still work). sfr_config
reads system.sips.w/h from spec and threads dims through to the
topology fn.
test_allreduce_multidevice CONFIGS switched from 4 SIPs (square)
to 6 SIPs: ring_1d_6sip, torus_2d_6sip_2x3, mesh_2d_no_wrap_6sip_2x3.
_write_temp_configs writes system.sips.w/h when supplied;
_sip_topo_dims reads them back. Latency sweep loop also moved to
6-SIP layouts. Linear-scale plot variants dropped -- only log-scale
*.png + summary.csv emitted. Plots in tests/allreduce_latency_plots
regenerated.
New tests/test_sip_topology_rectangular.py asserts neighbor
correctness for 2x3 layouts and back-compat for square fallback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The single-walk predictor (find_node_path(io_cpu, pe_cpu) +
compute_path_latency_ns) under-shot actual dispatch latency for far
cubes -- the routing graph could pick a path bypassing M_CPU, and
non-zero-nbytes launch sub-txns serialized on shared first hops.
Far PEs arrived at _execute_kernel after target_start_ns, silently
skipped the barrier yield, and started pe_exec_start late. Their
reported pe_exec_ns under-counted by exactly the late_ns amount
(63 ns observed at h4 cube4.pe0 in the IPCQ test, up to 113 ns
worst case for cubes 9-11), producing the suspicious flat region
in the h4 IPCQ curve at 8192/10240 bytes.
Fix:
- IO_CPU predictor uses the explicit two-leg chain
(IO_CPU->M_CPU + M_CPU->PE_CPU - io.overhead - m.overhead), so
every PE on every targeted cube has a barrier >= its real
dispatch arrival.
- Kernel-launch fanout sub-txns carry nbytes=0 (control-plane,
not data-plane), removing the per-cube fanout serialization
that pushed far M_CPUs past the predictor.
- Legacy io_cpu mirror updated.
ADR-0009 D5 mechanism updated to specify the two-leg formula and
the nbytes=0 requirement. New tests/test_d5_barrier_invariant.py
asserts (a) no PE enters _execute_kernel after target_start_ns and
(b) every PE in a multi-cube launch has identical pe_exec_start --
both regressions silently pass on the existing
tests/test_kernel_launch_sync.py because that test only inspects
post-aggregation max(pe_exec_ns).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PE_IPCQ._handle_recv now yields-from _delayed_credit_send instead of
spawning it as a fork, so the receiver's pe_exec_ns includes the
credit-return cost. _credit_latency_ns switches from
compute_drain_ns(path, 16) to compute_path_latency_ns(path, 16) and
fixes a latent find_path bug where the destination lacked the
".pe_dma" suffix (silently returned 0 ns under the bare except).
Net effect on h3/h4 inter-cube pe-to-pe latency: IPCQ >= raw DMA at
every size, matching real-HW posted-write semantics. tl.send remains
fire-and-forget. ADR-0023 D9 amended; new diagnostic test
tests/test_pe_to_pe_diagnostic.py captures per-PE pe_exec_ns, paths,
drain, and meta-arrival timing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- KernelLaunchMsg gains target_start_ns: IO_CPU stamps a global barrier
(max path latency across every target PE), M_CPU passes it through,
PE_CPU yields until it before recording pe_exec_start. Every PE in a
launch begins kernel execution at the same env.now regardless of its
dispatch path length — eliminates per-PE dispatch-offset artifact in
cross-PE and cross-cube latency measurements.
- PE_DMA._handle_ipcq_inbound now pays Transaction.drain_ns at the top,
matching the terminal-drain behavior of ComponentBase._forward_txn for
every non-IPCQ Transaction. SRC-side tl.send stays fire-and-forget
(sender doesn't yield on sub_done); tl.recv now blocks until bytes
have actually drained into its inbox.
- ComponentContext: new compute_path_latency_ns helper + node_overhead_ns
field populated by GraphEngine.
- tests/test_kernel_launch_sync.py: asserts all PEs in one launch
produce identical pe_exec_ns for a no-op kernel (zero spread).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds tests/test_pe_to_pe_latency.py: a sweep that measures PE-to-PE
transfer latency for five hop types (intra-cube horizontal/vertical,
inter-cube horizontal/vertical, inter-SIP) across data sizes 128 B to
10 KB, on both the IPCQ (tl.send/tl.recv) and raw-DMA (tl.load+tl.store)
paths. Emits per-hop PNG plots, an overview PNG, and a CSV summary into
tests/pe2pe_latency_plots/. Latency is reported as max(pe_exec_ns) across
participating PEs, read from engine.get_completion(), so the measurement
captures the SRC/DST PE's kernel body time rather than the full launch+
response-aggregation envelope.
Two simulator fixes were needed to make this measurement meaningful:
- PeMMU now stores a list of (start, end, pa) sub-regions per page
rather than a single PA. DPPolicy layouts with shards smaller than
page_size (e.g. 128 B payloads with 4 KB pages) used to silently
overwrite each other through last-write-wins, causing DMAs intended
for cube0 to physically route to cube3 - inflating latency by ~170 ns
per DMA at small sizes. STOPGAP: real MMUs don't support sub-page
regions; long-term fix is either smaller MMU page size or DPPolicy
validation that refuses sub-page shards.
- M_CPU's per-PE metrics aggregation (pe_exec_ns, dma_ns, compute_ns)
now max-merges against the existing value in result_data rather than
overwriting. Multi-cube workloads share one result_data dict via
IO_CPU fanout; the previous overwrite caused whichever cube's M_CPU
finished last to clobber others' values, so multi-cube pe_exec_ns was
racy and frequently 0. Same fix applied in legacy/builtin/m_cpu.py.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements ADR-0027 Phase 2 end-to-end. All 559 tests pass (was 523 +
1 xfail; ring_default_ws strict-xfail is now resolved).
D0 — Worker-wait generalization (context.py):
- _pending_worker_waits queue on RuntimeContext.
- ctx.wait(h) in worker context defers to main via g.parent.switch().
Fast-path for already-completed handles.
- Worker API is unchanged: tensor deploy, launch, etc. still look
synchronous; they're transparently cooperatively scheduled.
- Solves ADR-0024 Phase B kernel-greenlet orphan bug (env.run now
only ever drives from main; kernel _parent is always main).
D0.5 — Host-read barrier (tensor.py):
- Explicit _HOST_READ_BARRIERS registry (T5.g closed-set via code
review, not reflection-magic).
- numpy/data/__getitem__/__repr__ drain pending worker-waits before
host-observable read.
- copy_: source-side barrier via source.numpy(). Target-side write
barrier is intentionally NOT applied — global pending target barrier
prematurely drains cross-rank collectives → deadlock.
- Collective pending is excluded from barrier drain condition
(collective is cross-rank; its own yield in all_reduce covers the
invariant naturally).
D1 — torch.multiprocessing.spawn (runtime_api/multiprocessing.py):
- API signature parity with real PyTorch spawn; execution is
cooperative greenlet scheduler (process isolation etc. are explicit
non-goals per D1.0).
- _drain_pending drains worker-waits then collectives in one barrier,
loop-until-empty.
- Round-based exception handling with SystemExit sibling abort +
SpawnException(errors) wrapping root-cause ranks.
- RuntimeContext attaches ctx.multiprocessing in __post_init__.
- benches/ccl_allreduce.py hand-rolled loop collapses to one
torch.multiprocessing.spawn call.
D2–D6 — kernbench.tp package:
- parallel_state: initialize_model_parallel, get_*_rank,
get_*_world_size, with weak active-ctx registry in context.py.
- layers: ColumnParallelLinear, RowParallelLinear (shape-only
primitives — fp16 gemm via tl.load + tl.dot + tl.store).
- kernels: _gemm_kernel used by TP layers (self-contained; no bench
dependency).
- primitives / mappings stubs per D6/D8.
Data-path fixes (surfaced by TP gemm + all_reduce sequence):
- sim_engine/op_log.py: dma_write snapshot is skipped for TCM
sources (PE scratch is repopulated by Phase 2 math/gemm replay —
capturing Phase-1-time snapshot picked up STALE data from prior
kernel's output aliased at the same scratch addr, causing the later
kernel's dma_write to overwrite Phase 2 result with stale value).
- sim_engine/op_log.py + sim_engine/data_executor.py: per-operand
space recorded on GemmCmd and composite gemm records so HBM-resident
operands (tl.load output) don't default to TCM during replay.
- runtime_api/context.py: ctx.zeros writes zero-init to MemoryStore
at VA keys so kernels reading via VA see deterministic init even
without explicit copy_().
Tests (Phase 1 + Phase 2):
- test_worker_wait_drain (T3): orphan invariant + resume + multi-rank
drain + idempotency + exception propagation.
- test_mp_spawn (T4): spawn shape + bind + SpawnException scope.
- test_host_read_barrier (T5): barrier contract per entry-point +
closed-set registry check.
- test_tp_parallel_state (T1): initialize + rank lookup.
- test_tp_layers (T2): shape + deterministic numerical correctness
(concat-matmul equality for RowParallel, not mean-only).
- test_tp_mlp (T6): full 2-layer MLP with deterministic weight
numerical match + rank-consistency post all-reduce.
- test_ccl_allreduce_matrix: ring_default_ws xfail removed (T7).
Regression: 523 pre + 35 new + 1 ex-xfail = 559 passed, 1 intentional
skip (T3.e historical failure documentation).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause (hang diagnosis):
`kernel_runner.run()` captures `greenlet.getcurrent()` at spawn time as
the kernel greenlet's `_parent`. When a worker greenlet (say g0) calls
`dist.all_reduce` → `ctx.wait(h)` → `env.run(until=h0)`, the SimPy
scheduler steps pe_cpu processes, which in turn spawn kernel greenlets.
Those kernels' `_parent` becomes g0 (current greenlet at spawn). When a
kernel yields via switch_to_simpy, control jumps back up to g0's LAST
switch point — which is the main scheduler's `g.switch()` call — rather
than the kernel_runner's generator frame. Main then re-enters its
`for g in alive: g.switch()` loop mid-wait, producing nested greenlet
re-entry. Scheduler spins: g0 never completes, g1 appears to complete
out of order, infinite loop at 100% CPU.
Fix:
- AhbmCCLBackend.all_reduce: in multi-greenlet mode, submit via
launch(_defer_wait=True), extend backend._pending_collective_handles,
and yield to the parent greenlet. Worker does NOT call wait.
- benches/ccl_allreduce.py run(): after each scheduler round, the MAIN
greenlet drains backend._pending_collective_handles. This keeps
env.run invocation in the main context, so kernel_runner's spawned
kernel greenlets have main as their _parent — no nested re-entry.
- Legacy single-driver path (no bench scheduler): all_reduce falls back
to inline wait when g.parent is None.
Result:
- Multi-greenlet cross-SIP ring no longer hangs (was 100% CPU infinite
loop in kernel_runner._switch_kernel).
- ring_default_ws still xfail(strict=True): now fails as a data
correctness issue — DataExecutor reports only 1 math op for a 2-rank
ring (expected 2). Cross-SIP op_log replay integration is the
remaining Phase B task.
514 passed, 1 xfailed (strict).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2-rank bidirectional ring deadlock: when E and W neighbors point to the
same peer, sender-coord matching in _handle_meta_arrival / _credit_worker
picked the first direction in dict order, landing data in the wrong rx
slot relative to what the kernel recv(W) was waiting on.
Fix (ADR-0025 D1/D2/D3):
- install.reverse_direction: prefer OPPOSITE direction (E↔W, N↔S) when
peer has it pointing back to us; fallback to any matching for
topologies without opposite convention (tree_binary parent/child).
- _handle_meta_arrival: match by token.dst_addr range against each qp's
my_rx_base_pa + n_slots × slot_size window (unambiguous).
- _credit_worker: match by credit.dst_rx_base_pa == qp.peer.rx_base_pa.
- IpcqCreditMetadata: new dst_rx_base_pa field carrying receiver-side
rx base; _delayed_credit_send fills it from the consuming qp.
Tests (Phase 1 → Phase 2):
- test_reverse_direction_opposite_preference_2rank_ring
- test_reverse_direction_opposite_preference_4rank_ring_sanity
- test_meta_arrival_matches_by_dst_addr_same_peer
- test_credit_matches_by_dst_rx_base_pa_same_peer
- Existing credit-return test updated with dst_rx_base_pa.
508 tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tensor.__setitem__ / __getitem__:
- Shard-aligned slice assignment and read on deployed tensors.
- Scalar broadcast and numpy array assignment supported.
- Cross-shard slices raise NotImplementedError (use copy_ for that).
- 3 new tests: single-PE, multi-PE, cross-shard error case.
Hierarchical all-reduce kernel (src/kernbench/ccl/algorithms/):
- 3-level reduce: intra-cube (E/W) → inter-cube (N/S) → inter-SIP (parent).
- Bidirectional ring reduce at each level: ceil((N-1)/2) rounds.
Left half sends via dir_dec, right half via dir_inc (wrap).
Representative receives from both sides.
- Chain broadcast for reverse path: cube 0 PE 0 → all PE 0s → all PEs.
- Registered in ccl.yaml as "hierarchical_allreduce" with topology: none
(neighbors() override builds the full 3-level neighbor map).
- kernel_args derives pes_per_cube/cubes_per_sip/num_sips from world_size.
- Mock-verified at 8/16/32/64/128 ranks.
Mock runtime fixes:
- Direction pairing: explicit N↔S, E↔W, parent↔parent instead of
"first matching reverse". Fixes 2-element rings where N and S both
point to the same peer.
- Deadlock detection: send-counter based (not just queue-depth-total)
to catch chain reductions where send+recv pairs net to zero.
- Multi-cube program_id: pes_per_cube parameter enables
program_id(axis=0) = PE within cube, program_id(axis=1) = cube id.
Legacy single-cube tests unaffected (default = world_size).
504 tests pass in 12s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: In ring all-reduce, PE_IPCQ's recv handler advances my_tail
and issues a credit return immediately. With tight credit latency
(0.12ns intra-cube), the sender can refill the slot BEFORE the
receiver's outbound PE_DMA reads from it for the next send. The
outbound snapshot then captures stale data from a later round.
Fix: Propagate TensorHandle.data (captured at recv-time, before credit
return) through the entire send chain:
tl.send(src=handle) → IpcqSendCmd.data → IpcqDmaToken.data
PE_DMA outbound already prefers token.data over MemoryStore read, so
the recv-time snapshot is used for the in-flight data. This eliminates
the race: the snapshot is captured before the slot can be overwritten.
Additional fixes:
- PE_MATH handle_command: compute SIMD latency from output tensor
element count via _compute_ns(), using max(overhead_ns, compute_ns).
Previously used overhead_ns=0.0 for all standalone MathCmd, making
math ops take 0ns in SimPy.
- DataExecutor secondary sort: same-t_start ops sorted by op_kind
(memory < gemm < math) so IPCQ slot writes execute before math reads.
- ipcq_copy recorded at INBOUND time (receiver PE_DMA arrival) instead
of outbound. Inbound time is after fabric propagation, so it sorts
correctly relative to the receiver's math.
- record_copy accepts explicit snapshot parameter (from token.data).
Result: N_ELEM=32 + 256-rank + n_slots=4 + cross-SIP now passes.
n_slots reverted to 4 (the deeper buffer was a workaround, not needed).
502 tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test matrix restructure:
- 256-rank full-system ring runs only ONCE (marked pytest.mark.slow)
instead of 7× across matrix + perf tests. Cross-SIP routing is
verified by the single run; buffer variants (tcm/hbm/sram) are
tested at 8-rank where they finish in <0.5s.
- Performance tests use 8-rank instead of 256-rank.
- `pytest -m "not slow"` completes in ~2.5min (local dev).
- Full suite including slow: ~6min (CI).
DataExecutor optimization:
- Remove ThreadPoolExecutor from DataExecutor.run(). Same-t_start
groups are almost always size 1, so the thread pool creation and
dispatch overhead dominated. Simple sequential loop is faster.
- Skip dma_read ops at the loop level (they are always no-ops in
Phase 2 but were dispatched through _execute_op → _execute_memory).
- Remove redundant CLI Phase 2 re-execution: engine._flush_data_phase
already replays during engine.wait(); the CLI now only prints the
diagnostic summary without re-running DataExecutor.
502 tests pass. Wall time: 25m30s → 5m43s (full), 2m28s (no slow).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
tl.program_id(axis=0) returns local PE id within cube,
tl.program_id(axis=1) returns cube id. Enables cube-aware
sharding in benchmark kernels.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- CLI: --verify-data flag enables Phase 2 data verification (ADR-0020)
- Tensor.data: returns actual numpy values (verify-data) or zeros placeholder
- Tensor.__repr__: shows value summary or data=N/A (placeholder)
- DataExecutor: ThreadPoolExecutor for same-timestamp parallel op execution
- BenchResult.engine: exposes op_log/memory_store for Phase 2 access
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>