Move benches/ -> src/kernbench/benches/ and src/kernbench/cli/probe.py ->
src/kernbench/probes/probe.py. Each bench self-registers via
@bench(name=..., description=...); kernbench list enumerates benches
with auto-assigned indices, --bench accepts kebab-case name or numeric
index. Audit at package-import time fails if any non-underscore module
forgets the decorator. ADR-0010 (EN + KO) updated to reflect the new
resolver path, list subcommand, and probes package separation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the bilingual-structure commit: docs/adr-ko/ now holds
only Korean versions (24 files translated from English placeholders),
ADR-0013 slug uses kebab-case in both folders, and the verify tool
allows translated parenthetical commentary in the Status block.
- Translate 24 English files in docs/adr-ko/ to Korean. The previous
bilingual-structure commit had left these as English copies because
their source content was already English; this commit fulfills the
policy that docs/adr-ko/ contains only Korean.
- Rename ADR-0013 in both adr/ and adr-ko/ from
ver-verification_strategy.md to ver-verification-strategy.md
(kebab-case consistency with other ADRs).
- CLAUDE.md (ADR Translation Discipline): clarify that only the
Status lifecycle keyword (Accepted / Proposed / Stub / Draft /
Superseded by ADR-NNNN / Merged into ADR-NNNN) must match across
EN and KO; parenthetical commentary and trailing list items may be
translated.
- tools/verify_adr_lang_pairs.py: replace byte-equal Status check
with normalize_status_keyword() which strips parenthetical
commentary and takes only the first non-empty line.
- tests/test_verify_adr_lang_pairs.py: update existing test names,
add coverage for translated parenthetical, translated trailing
list, and Superseded-by-NNNN keyword equality.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Establish English as the canonical ADR language with Korean translations
held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror).
Promotion from adr-proposed/ to adr/ now writes English to adr/ and the
Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md.
- Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English,
2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix
dropped). ADR-0023 EN regenerated against KO source which had newer
HW Realization Notes (D16-D23) section.
- docs/adr-history/ left frozen by design (transitional state).
- CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark
docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline
section covering bidirectional sync, conflict resolution (EN wins),
and proposed-language freedom.
- tools/verify_adr_lang_pairs.py: new verification tool checking pair
completeness, filename mirroring, ADR-ID match, Status byte-equality.
Pre-commit hook intentionally not added; run on demand or in CI.
- tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF
normalization, em-dash title separator, underscore-slug edge case.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- CLAUDE.md: add ADR Lifecycle subsection (superseded → docs/history/,
immutable numbering, no renumber)
- ADR-0011: merge ADR-0018 content as "Address Model: LA" section
alongside PA / VA; status notes VA model is currently implemented
- ADR-0018 / 0029 / 0031: moved to docs/history/ with status updates
(0018 merged into 0011, 0029 superseded by 0032, 0031 absorbed
into 0001 rev 2)
- ADR-0019: rewrite Context as PE-HBM connectivity decision
(self-contained, no LA model framing)
- ADR-0019/0020/0021/0023/0025/0027: Status Proposed → Accepted
(code verified) and prune Implementation Notes / Affected files /
Test strategy / "현재 상태" sub-sections describing pre-impl state
- ADR-0024/0026: same migration-flavor cleanup; 0026 also drops D6
Migration and D8 docs-update sub-decisions
- ADR-0030: status simplified (blocker ADR-0031 now superseded)
- SPEC.md: R10 + §0.2 reflect PA / VA / LA model names
- ADR-0008/0012/0013: refresh ADR-0011 subtitle in Links
21 files changed, 553 insertions(+), 1290 deletions(-).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extend tl.composite() with an ordered epilogue list. Each op carries
a scope flag - output_tile (default, runs once per (m,n) before
STORE), k_tile (every K-tile right after GEMM), or kernel. Plan
generator slots MATH stages by scope; pe_math reuses pe_dma's
local-loop pattern so chained epilogues (bias->relu) skip the port
hop. op_log captures per-stage params for telemetry. Topology
gains a gemm->math edge (snapshot test updated).
API stays backward-compatible - `epilogue=` is opt-in.
Example:
h = tl.composite(
op="gemm", a=a, b=b, out_ptr=int(out),
epilogue=[
{"op": "dequant", "scale": s_per_k, "scope": "k_tile"},
{"op": "bias", "bias": bias_vec},
{"op": "relu"},
{"op": "scale", "factor": 0.5},
],
)
tl.wait(h)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User asked to surface system-wide congestion (more accurate than
single-cube), bring back the latency-breakdown plot under a separate
filename, and rename the obscure ``streaming`` category.
Scenarios:
Renamed all_pe_to_pe0 → all_pe_cube0_to_pe0 (clarify cube scope).
Added two SIP-wide scenarios:
sip_local_all — every PE in sip0 (128 total) accesses its own
local slice. All paths disjoint (each PE owns
its own hbm_ctrl.peX), so the model should
scale linearly with cube count.
sip_hotspot_pe0 — every PE in sip0 (128 total) targets
sip0.cube0.pe0_slice. Worst-case hotspot:
UCIe inbound + r0c0→hbm_ctrl.pe0 saturated.
Each bar now carries an ``N=...`` annotation showing the issuer
count, and the chart titles say the scope explicitly.
Effective BW + util at 16 KB:
sip_local_all N=128 eff= 27.2 TB/s util_a= 83 %
sip_hotspot_pe0 N=128 eff= 134 GB/s util_a= 93 %
(UCIe-into-cube0 saturated)
Plots:
no_congestion.png + congestion.png — Effective BW utilization
(two bars: single vs aggregate peak)
breakdown_no_congestion.png +
breakdown_congestion.png — stacked latency breakdown
(renamed from previous)
summary.csv with columns for both views.
The visual y-cap on BW utilization is 150 %. Bars exceeding it (e.g.
sip_local_all's util_single = 10,639 %) are drawn at the cap with an
upward arrow and the real value annotated. The verification rule for
``util_single`` is loosened to ``≤ n_issuers × 100 % + 5 %`` so
massively-parallel disjoint scenarios pass.
Category renamed: ``streaming`` → ``wire_transfer``. It is the
bulk-transfer time = (n_flits − 1) × flit_bytes / bottleneck_bw — the
cost of streaming the rest of the payload through the slowest wire
after the first flit has arrived.
All checks PASS.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each scenario now shows TWO bars:
util_single = effective_bw / single-path peak × 100
(peak = min bw_gbs on first issuer's path)
util_aggregate = effective_bw / aggregate-resource peak × 100
(peak = max-min fair share across concurrent paths)
Aggregate peak uses a max-min fair-share computation: each concurrent
path's sustainable share on an edge is bw_gbs / usage_count, the
per-path throughput is the min share along its edges, and the aggregate
peak is the sum across paths. This produces the correct answer for both
shared-bottleneck scenarios (N paths converge on one wire → aggregate =
wire BW) and multi-lane shared resources (UCIe's 4 connections used in
parallel → aggregate ≈ 4 × per-conn BW), without enumerating max-flow.
Single-issuer (no_congestion) → util_single == util_aggregate by
definition. Congestion exposes the divergence:
ctrl_hot_{1,2,3}, all_pe_to_pe0 → both metrics agree (one shared
bottleneck: r0c0→hbm_ctrl.pe0 @ 256 GB/s)
8×PE eastbound → util_single=106 % (single conn @ 128 GB/s) but
util_aggregate=85 % (UCIe-W.conn0 @ 7-way shared,
aggregate peak ≈ 160 GB/s under the current
cross-cube routing that funnels via cube1.r0c0).
Verification updated to assert:
(2) util_aggregate ≤ 100 % (effective BW can't exceed the aggregate
resource peak, by construction).
(3) single-issuer util_single == util_aggregate.
(7) ucie_eastbound: util_aggregate is meaningfully smaller than
util_single (the multi-lane peak correction is observable).
CSV grows with peak_aggregate_bw_gbs and util_aggregate_pct columns;
breakdown columns retained.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the latency-breakdown stacked bars with a single utilization
bar per scenario. Each bar shows ``effective_bw / peak_bottleneck_bw``
with both values annotated, and a horizontal "single-path peak" line at
100 %. The colour band (green ≥70 %, amber ≥40 %, red <40 %) makes the
no-congestion distance roll-off scannable at a glance.
Definitions:
effective_bw = (total bytes transferred) / wall-clock time
no_congestion: nbytes / total_ns
congestion: n_issuers × nbytes / makespan_ns (aggregate)
peak_bw = min(edge.bw_gbs) on first issuer's path
util_pct = effective_bw / peak_bw × 100
The congestion graph shows that 8×PE eastbound exceeds 100 % of a
single-path peak (106.4 %): UCIe-N's 4 connections × 128 GB/s give
512 GB/s of aggregate eastbound capacity, so concurrent issuers across
disjoint conns sum past any single conn's 128 GB/s. The 8×PE→pe0_slice
hotspot reaches 91.7 %, almost saturating the shared r0c0→hbm_ctrl.pe0
bottleneck — the simulator's address-based PC striping + per-flit
arbitration model amortises the cost cleanly.
Self-verification updated to BW invariants:
(1) effective BW shrinks as topological distance grows
(2) util_pct ∈ (0, 250 %]
(3) single-issuer util_pct ≤ 100 %
(4) effective_bw = nbytes / total_ns for single requests
(5) congestion aggregate BW grows monotonically with issuer count
on the hot-target series
(6) 8-PE all-hit-pe0 saturates ≥ 70 % of shared peak
All checks PASS at the current model.
The CSV retains all breakdown components (pe_setup, noc_mesh, ucie,
fabric, streaming, hbm_ctrl, contention) so a future replot can still
recover the latency-breakdown view without re-running the simulator.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/plot_pe_dma_perf.py runs the simulator across six
no-congestion scenarios (SAME_CUBE_PE_LOCAL / REMOTE_BEST /
REMOTE_WORST, REMOTE_CUBE_BEST / REMOTE_WORST, REMOTE_SIP) and
five congestion scenarios (1/2/3 PE hot-target, 8-PE corresp.
cube-to-cube, 8-PE all-hit-pe0). It categorises actual total /
makespan into pe_setup, noc_mesh, ucie, fabric, streaming,
hbm_ctrl, and a contention residual using a wormhole-pipelined
model (first-flit arrival + (n_flits-1)/bottleneck + final
chunk_time).
Outputs:
docs/diagrams/pe_dma_perf/no_congestion.png — single-PE latency
by topological distance. Visualises monotonic growth from
SAME_CUBE_PE_LOCAL (77 ns) up to REMOTE_CUBE_PE_REMOTE_WORST
(573 ns) and REMOTE_SIP (409 ns).
docs/diagrams/pe_dma_perf/congestion.png — makespan as concurrent
issuer count grows. ctrl_hot_{1,2,3}=82/158/230 ns; 8-PE
eastbound UCIe = 963 ns; 8-PE all-hit-pe0 = 558 ns.
docs/diagrams/pe_dma_perf/summary.csv — raw rows for re-plotting.
Built-in --verify harness asserts:
(1) distance monotonicity for no-congestion;
(2) same-cube paths contain zero UCIe budget;
(3) remote-cube/SIP paths carry positive UCIe budget;
(4) breakdown is internally consistent (formula ≤ actual);
(5) streaming term matches (n_flits-1) × flit_bytes /
bottleneck_bw within 5 % for the local scenario;
(6) congestion makespan is monotonic in issuer count;
(7) 8-PE hotspot strictly exceeds 3-PE hotspot.
Cross-SIP gets a looser 70 % contention slack because the path
crosses two non-flit-aware (pcie_ep) boundaries that force
store-and-forward re-streaming the simple formula does not
attribute. Single-cube scenarios stay under 25 % residual.
All checks PASS at the current model (post ADR-0019 D1/D4
per-PE HBM CTRL restoration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restores per-PE HBM controller partitioning that was lost in
commit 5917b34 ("Replace xbar/bridge/single-NOC with explicit
router mesh"), which had over-consolidated the per-slice HBM CTRL
into a single cube-wide ``hbm_ctrl`` connected to every router —
the opposite of what ADR-0019 D1/D4 specifies.
Builder splits ``hbm_ctrl`` into 8 ``hbm_ctrl.pe{X}`` instances per
cube, each reachable ONLY through PE_X's attaching router via the
existing ``peX.hbm`` attach metadata from cube_mesh.yaml. Cube
aggregate BW now matches the spec (8 PEs × 8 PCs × 32 GB/s =
2048 GB/s) instead of collapsing to 256 GB/s.
AddressResolver decodes the target PE from the HBM PA's hbm_offset
(``offset // slice_size``) and returns ``hbm_ctrl.pe{X}``. PathRouter
uses the existing ``_adj_local`` adjacency for same-cube PE_DMA so
the cube's own UCIe port can no longer appear as a zero-distance
shortcut between routers — local PE_DMA now traverses the mesh,
restoring the ADR-0019 D4 worked example
``PE0.pe_dma → r0c0 → … → r1c4 → hbm_ctrl``.
Tests:
- New tests/test_per_pe_hbm_partition.py: 14 tests covering
topology shape, per-PE router exclusivity, PA resolution,
single-hop local path, cross-PE mesh traversal, and end-to-end
latency monotonicity. Probe CLI now reports
pe-local < pe-same-half < pe-cross-half (was uniform 141ns).
- Existing tests updated for new node ids and replaced two
assertions that locked in the wrong consolidation:
test_noc_mesh.test_hbm_connects_to_all_routers and
test_topology_compile.test_hbm_ctrl_connects_all_routers are
now per-PE exclusivity assertions; test_routing
.test_all_pe_hbm_equidistant becomes
test_cross_pe_hbm_distance_increases_with_mesh_hops.
- test_ipcq_buffer_kind_locations.test_hbm_pe_hop_charged_at_large_payload
threshold recalibrated 4000→1500 ns: the prior figure reflected
serialization on the over-consolidated single hbm_ctrl; per-PE
partitioning removes that artificial contention so the gap
shrinks to the genuine PE↔HBM-hop cost.
Full suite: 645 passed, 1 skipped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces global round-robin with deterministic address-derived PC
striping:
pc_shift = log2(burst_bytes)
pc_mask = num_pcs - 1
pc = (flit.address >> pc_shift) & pc_mask
Each Transaction carries base_address (HBM byte offset of the first
chunk); each Flit derives its own address as base + i*flit_bytes.
HBM CTRL routes flits to PCs via this formula, replacing the
arrival-order RR pointer. Also splits the is_last wait into an
asynchronous _finalize_txn process so the worker isn't blocked on
PC commit, exposing true PC parallelism for disjoint addresses.
phyaddr.py documents the canonical bit layout (bits [10:8] for the
default burst=256, num_pcs=8 case). ADR-0033 D6 records the
derivation and the workload scenarios where address-striping
matters (strided streams, offset-disjoint parallel transfers).
Adds tests/test_hbm_address_based_pc.py: canonical bit mapping,
strided 8-way load distribution, same-address PC-0 serialization,
PC-aligned 2KB pair collision, dynamic pc_shift from burst_bytes,
and power-of-2 attr validation. Integration tests inspect
_pc_avail ledger directly: at default config UCIe's 8 ns per-txn
overhead exactly matches chunk_time, masking PC contention at the
makespan level even though the ledger correctly distinguishes the
cases.
Full suite: 631 passed, 1 skipped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Allreduce + pe2pe + ipcq + pe_view auto-regenerated by test sweeps
running against the new chunk-streaming wire timing (per-flit
wormhole) — absolute numbers shift upward to reflect bottleneck-link
transit charged once per flit (instead of the previous cut-through
subtraction at HBM CTRL).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Standalone summary of the modeled hardware hierarchy and components.
Cross-references ADR-0003, 0004, 0014, 0017, 0022.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NOC topology is an implementation choice (mesh, ring, crossbar, etc.).
ADR-0017 covers the current 2D mesh choice; ADRs at the system-level
shouldn't bind to that specific implementation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cycle-accurate arbitration policies (priority/iSLIP) downgraded to
"academic / specific use cases" — FIFO inbox is approximately fair
for typical similar-rate workloads (GEMM, AllReduce, data parallel).
True impact appears only for QoS modeling or per-stream tail latency
analysis under saturation.
Higher-priority items pulled forward: address-based PC selection at
HBM CTRL (directly affects multi-PE concurrent HBM contention), bank
conflict modeling, HBM scheduler, finite buffer backpressure, op_log
chunk-streaming integration.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier the future-work list mentioned "multi-flow fair sharing on a
single shared link" which was confusing — each wire has a single
source, so this isn't a real gap. The actual modeling story:
- Multi-stream merging at routers IS handled via per-in_port fan_in +
shared inbox + FIFO worker forwarding. Flits from different
upstream streams interleave at flit granularity naturally.
- What's NOT modeled: cycle-accurate arbitration policies (priority,
iSLIP), address-based PC selection at HBM CTRL (round-robin is
address-blind, so size-aligned concurrent transactions hit full
PC contention even when real-HW address striping would diverge),
sub-flit (32B) granularity, finite buffer backpressure, and bank
conflict modeling.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- test_op_log_per_transaction_not_per_flit (renamed from
..._records...): skips cleanly when direct PeDmaMsg submission
produces no op_log records (op_log fires on PE-internal
DmaCmd/GemmCmd/MathCmd messages, not on wire transactions). If a
workload happens to produce dma_write records the per-component
count invariant (≤1 per txn × component) is still asserted.
- ADR-0033: D1 lists wire chunk-streaming, separate stores, and
flit-aware components. D2/D3/D4 updated for new wire model.
D6 future work notes op_log full integration with chunk-streaming.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous model double-counted slow-upstream paths (e.g., 64KB via UCIe
128 GB/s was ~2x pessimistic). HBM CTRL now distributes bursts across
8 pseudo-channels via global round-robin, with per-chunk commit timing
that pipelines correctly against the bottleneck link's data arrival.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related issues caused measured pipeline efficiency to look
worse than the simulator's actual behavior:
1. DMA timing recorded too early. The op-log start timestamp
for a DMA op fired when the request entered the queue, and
the DMA channel was released as soon as the request was
issued. Back-to-back DMAs therefore appeared to grab the
channel simultaneously, with per-op duration drifting
upward as queue depth grew - an artifact, not real cost.
Fix: defer the start timestamp until after the channel is
acquired, and hold the channel through the full HBM
round-trip until the response returns. Per-op duration is
now constant and equal to the actual transfer interval;
serialization is visible as queue wait, not as inflated
service time.
2. Sweep timing window folded in pre-composite work. The PE
timing window spanned every PE engine record, which
included the upfront pinned-operand DMA issued before the
composite GEMM begins. For large-K shapes that one-shot
load can be nearly half of the window, conflating
operand-staging cost with composite-pipeline behavior.
Fix: add a second window scoped to the composite pipeline
by filtering op_log records to those tagged with a
tile-pipeline stage; the legacy operand-load path is
untagged and naturally excluded. For 32x3072x32 load_ref
the window drops from 1765ns to 992ns and measured eff
lines up with the steady-state DMA-bound stage limit
instead of being penalized for the one-time load.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/build_overview_slides.py renders a 5-slide PPTX
(kernbench2_overview.pptx) summarizing architecture, model
correctness, IPCQ, allreduce, and buffer-kind tier comparison.
scripts/emit_overview_with_external_ref.py renders log-y and
broken-y variants of the allreduce overview (overview_log.png,
overview_broken.png) including a 366 µs ext-sim reference marker
at 96 KB / PE.
Also includes cube_mesh_view.png rendered from the SVG.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pe2pe overview compared IPCQ (tl.send + tl.recv) against raw DMA
(tl.load + tl.store), but DMA is one-sided — DST never reads — while
tl.recv pays a slot-read on DST. The comparison was unfair: IPCQ
looked slower partly because it does more work.
Adds tl.recv_no_consume() — a separate, diagnostic-only entry point
that blocks for slot arrival but skips the slot-read (and bank-hop)
charge on DST. Production tl.recv is unchanged (no `consume` kwarg
on the public API), so the diagnostic flag can never accidentally
leak into real workloads.
Updates test_pe_to_pe_latency to call tl.recv_no_consume so the
overview.png shows IPCQ no-consume vs raw DMA on equal footing.
Also fixes PLOT_DIR back to docs/diagrams/pe2pe_latency_plots/
(was lost in a merge). Adds scripts/replot_pe2pe.py for label-only
re-renders without re-measuring.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cube SRAM and HBM live on the cube NoC behind router-attached links
(sram_to_router_bw_gbs=128, hbm_to_router_bw_gbs=256). Previously the
slot-IO model treated them as if they were per-PE local, so the
buffer_kind sweep showed TCM ≈ SRAM at 64 KB / PE.
pe_ipcq._handle_recv and pe_dma._handle_ipcq_inbound now charge a
PE→bank compute_drain_ns on top of the intrinsic slot-IO for SRAM/HBM.
TCM stays free of this hop. Adds an internal IpcqRecvCmd.consume field
that gates the recv-side hop+slot-IO charges (used by a follow-up
diagnostic API; default True keeps current behavior).
Post-fix at 64 KB / PE: TCM 12.0 µs < HBM 21.4 µs < SRAM 24.3 µs.
SRAM is slowest because its 128 GB/s bank link is the narrowest in
the system — narrower than HBM's 256 GB/s. The existing ordering test
is rewritten from tcm<sram<hbm to tcm<hbm<sram and a new
test_ipcq_buffer_kind_locations adds 3 invariants on the gap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Conflict resolution:
- intercube_allreduce.py: kept origin's `if single_cube:` early-exit
(TP launches kernel on one cube/rank → skip intra-SIP mesh and go
direct to inter-SIP exchange) AND replaced the multi-cube body with
the local center-root + bidirectional reduce/broadcast (8-hop
critical path on 4×4 vs 12 with corner root).
- tests/{allreduce,pe2pe}_latency_plots/: kept the local move to
docs/diagrams/; dropped origin's stale content edits to the old
paths (regenerable derived artifacts).
- docs/diagrams/pe2pe_latency_plots/summary.csv: kept local
(post-Phase-2 + center-root values).
Origin contributions retained as-is:
- pyproject.toml: matplotlib >= 3.7 dep.
- runtime_api/distributed.py: derive effective cube_w/h from tensor
shard placement so single-cube TP paths get cube_w=cube_h=1.
- kernel_args() now accepts optional cube_w/cube_h kwargs.
Verified post-merge:
- test_intercube_root_center.py: 2/2 (center-root multi-cube path).
- test_tp_layers.py + test_tp_mlp.py: 10/10 (single-cube TP path).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the algorithmic root cube from the corner (cube_w-1,
cube_h-1) to the geometric center (cube_w//2, cube_h//2) and
have each phase converge bidirectionally so the intra-SIP
critical path drops from ~12 hops to ~8 hops on a 4×4 mesh
(left half W→E + right half E→W in row reduce; top half N→S +
bottom half S→N in col reduce; mirrored on broadcast).
Result on torus_2d 6 SIPs at 96 KB / PE on TCM:
before (corner root) : 22.0 µs
after (center root) : 17.2 µs (−22%)
Same shape on ring_1d (−7%) and mesh_2d_no_wrap (−12%); also
holds across SRAM and HBM (~−20% each).
Phase 1 test (test_intercube_root_center.py) asserts the
torus_2d 96 KB latency drops below 20.5 µs and that all 96
cubes still validate (correctness preserved).
Plot updates:
- overview.png: replace constant 10.6 µs theoretical line with
user-supplied hand-derived curve (per-cube packet count =
bytes_per_pe × 8 PEs ÷ 128 B; 1346 ns startup + 1.20 ns/pkt).
- All summary.csv numbers and per-topology PNGs regenerated.
- pe2pe_latency_plots and ipcq diagram emitter PNGs refreshed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plot output dirs now live under docs/diagrams/ (the canonical
"derived artifacts" location per CLAUDE.md):
tests/allreduce_latency_plots/ → docs/diagrams/allreduce_latency_plots/
tests/pe2pe_latency_plots/ → docs/diagrams/pe2pe_latency_plots/
+ new docs/diagrams/ipcq_diagram_plots/ with two presentation diagrams
(ipcq_send_recv.png, ipcq_two_pe_dma.png)
New test tests/test_emit_ipcq_diagram.py renders the two IPCQ
diagrams from a static description (no simulation); it exists so
the diagrams can be regenerated reproducibly.
Path references updated in tests/test_pe_to_pe_latency.py.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add intra_N/S/E/W to install.py _OPPOSITE_DIR table so the intra-cube
PE-to-PE namespace is symmetrical with intercube N/S/E/W. ADR-0032
documents the intercube allreduce algorithm (supersedes ADR-0029).
Refresh ADR-0024/0025/0029 cross-refs and update
test_intercube_sfr_config.py to cover the new intra_* mappings. Drop
the obsolete test_ccl_round_robin_recv.py (replaced by intercube tests).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove rack_id (4 bits), rename sip_seg→die_id, shift fields to enable
42-bit local_offset (4 TB per die). Define PE_LOCAL/MCPU_LOCAL/CUBE_SRAM
sub-unit tables for AHBM dies and IOCPU sub-unit table for IOCHIPLET
dies (1 TB window). Supersedes ADR-0031.
Also fixes latent VA/PA confusion in pe_dma pipeline DMA path where
virtual addresses were decoded as physical addresses without MMU
translation — previously masked by coincidental bit-position alignment.
529 passed (+6 recovered), 10 pre-existing failures unchanged.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The single-walk predictor (find_node_path(io_cpu, pe_cpu) +
compute_path_latency_ns) under-shot actual dispatch latency for far
cubes -- the routing graph could pick a path bypassing M_CPU, and
non-zero-nbytes launch sub-txns serialized on shared first hops.
Far PEs arrived at _execute_kernel after target_start_ns, silently
skipped the barrier yield, and started pe_exec_start late. Their
reported pe_exec_ns under-counted by exactly the late_ns amount
(63 ns observed at h4 cube4.pe0 in the IPCQ test, up to 113 ns
worst case for cubes 9-11), producing the suspicious flat region
in the h4 IPCQ curve at 8192/10240 bytes.
Fix:
- IO_CPU predictor uses the explicit two-leg chain
(IO_CPU->M_CPU + M_CPU->PE_CPU - io.overhead - m.overhead), so
every PE on every targeted cube has a barrier >= its real
dispatch arrival.
- Kernel-launch fanout sub-txns carry nbytes=0 (control-plane,
not data-plane), removing the per-cube fanout serialization
that pushed far M_CPUs past the predictor.
- Legacy io_cpu mirror updated.
ADR-0009 D5 mechanism updated to specify the two-leg formula and
the nbytes=0 requirement. New tests/test_d5_barrier_invariant.py
asserts (a) no PE enters _execute_kernel after target_start_ns and
(b) every PE in a multi-cube launch has identical pe_exec_start --
both regressions silently pass on the existing
tests/test_kernel_launch_sync.py because that test only inspects
post-aggregation max(pe_exec_ns).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PE_IPCQ._handle_recv now yields-from _delayed_credit_send instead of
spawning it as a fork, so the receiver's pe_exec_ns includes the
credit-return cost. _credit_latency_ns switches from
compute_drain_ns(path, 16) to compute_path_latency_ns(path, 16) and
fixes a latent find_path bug where the destination lacked the
".pe_dma" suffix (silently returned 0 ns under the bare except).
Net effect on h3/h4 inter-cube pe-to-pe latency: IPCQ >= raw DMA at
every size, matching real-HW posted-write semantics. tl.send remains
fire-and-forget. ADR-0023 D9 amended; new diagnostic test
tests/test_pe_to_pe_diagnostic.py captures per-PE pe_exec_ns, paths,
drain, and meta-arrival timing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- KernelLaunchMsg gains target_start_ns: IO_CPU stamps a global barrier
(max path latency across every target PE), M_CPU passes it through,
PE_CPU yields until it before recording pe_exec_start. Every PE in a
launch begins kernel execution at the same env.now regardless of its
dispatch path length — eliminates per-PE dispatch-offset artifact in
cross-PE and cross-cube latency measurements.
- PE_DMA._handle_ipcq_inbound now pays Transaction.drain_ns at the top,
matching the terminal-drain behavior of ComponentBase._forward_txn for
every non-IPCQ Transaction. SRC-side tl.send stays fire-and-forget
(sender doesn't yield on sub_done); tl.recv now blocks until bytes
have actually drained into its inbox.
- ComponentContext: new compute_path_latency_ns helper + node_overhead_ns
field populated by GraphEngine.
- tests/test_kernel_launch_sync.py: asserts all PEs in one launch
produce identical pe_exec_ns for a no-op kernel (zero spread).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tl.program_id(axis=0) returns local PE id within cube,
tl.program_id(axis=1) returns cube id. Enables cube-aware
sharding in benchmark kernels.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Design for actual data storage/computation in HBM/TCM/SRAM components:
- Phase 1: SimPy timing + MemoryStore (memory ops data-aware via greenlet)
- Phase 2: op_log-based numpy execution for GEMM/Math verification
- Greenlet-based KernelRunner replaces Phase 0 command list generation
- tl.load() returns real data in Phase 1, enabling memory-based control flow
- ComponentBase hook for op logging (single source of truth)
- MemoryStore: numpy ndarray tensor-granular storage with reference semantics
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Component placement uses mm coordinates in topology.yaml, mesh_gen
finds the nearest router automatically. M_CPU moved to pos_mm=[7.5,2.0]
(→ r0c2), SRAM at pos_mm=[1.5,9.0] (→ r3c0).
No hardcoded router references in topology config.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stub length increased to 12px (PE/HBM) and 10px (UCIe).
Gap between router and component increased to 30px so both
45° stubs (router end + component end) are clearly visible.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- M_CPU placed north (above) its router
- All connectors: 45° stub from router → straight → 45° stub to component
- Consistent 4-point polyline pattern for PE, M_CPU, SRAM, HBM, UCIe
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
UCIe position calculated with minimal inset (0.3 × size) to
place components flush against cube boundary edges.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All connectors now start with 45° diagonal from router edge,
then go straight (vertical/horizontal) to the component block.
Applies to PE, M_CPU/SRAM, PE→HBM, and UCIe connectors.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Router-router mesh links remain straight (horizontal/vertical).
All component→router connectors use 45° L-bend polylines:
- PE blocks: vertical then 45° diagonal to router
- M_CPU/SRAM: horizontal then 45° diagonal to router
- PE→HBM port group: vertical then 45° diagonal
- UCIe port→router: direction-aware 45° bend
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
UCIe components placed at defined positions from _cube_local_positions
with size from cube.geometry.ucie_mm.size. N/S horizontal, E/W vertical.
Connection ports rendered as color-coded boxes inside UCIe component.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>