Add 5 of the 6 figure renderers ADR-0057 D3 sub-cycle 4c specifies:
- gqa_op_log_{panel}.png × 4 — per-panel bar chart of the 5 op_log
counts (gemm, ipcq_send, ipcq_recv, dma_read, dma_write).
- gqa_comparison.png — cross-panel grouped bars over the same 5 series.
Sixth figure (gqa_scaling.png) depends on sub-cycle 4b's Q/cube ∈
{1, 2, 4} sweep on multi_user_* panels and is deferred until that
data exists; emit_all_gqa_plots returns just the 5 in-scope paths.
Add MILESTONE_FAST=1 mode to run(): skip the panel sweep, reuse the
committed sweep.json, render figures only. Validation mode unchanged.
The runtime errors clearly when neither env var is set, listing the
two supported modes.
Renderers live in the bench module (the milestone-1h-gemm pattern);
tests/gqa/_gqa_plot_helpers.py re-exports them for figure tests.
Tests: tests/gqa/test_plot_gqa_figures.py — 7 tests, all green:
- 4 parametrized per-panel emit assertions
- 1 comparison emit assertion
- 1 emit_all returns exactly 5 PNG paths
- 1 default out_dir matches the bench _OUTPUT_DIR
Commits the 5 PNG baselines under the bench output dir alongside
sweep.json, mirroring milestone-1h-gemm's committed-figures pattern.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-contained eval bench (ADR-0054) that drives the four GQA Llama-70B
panels through run_bench with enable_data=True at validation scale and
emits sweep.json with the v1 schema (ADR-0057 D7).
Panel dispatch table maps each panel to (kernel, SFR install, S_q,
n_ranks, rank_axis):
single_user_prefill mesh_kv_kernel, intracube_pe_ring, S_q=16, n=8, rank_axis=0
multi_user_prefill mesh_kv_kernel, intercube_multisip, S_q=16, n=4, rank_axis=1
single_user_decode mesh_mlo_kernel, intracube_pe_ring, S_q=1, n=8, rank_axis=0
multi_user_decode mesh_mlo_kernel, intercube_multisip, S_q=1, n=4, rank_axis=1
multi_user panels pass _auto_dim_remap=False (avoid d_head=64
colliding with K's global M=64) and rank_axis=1 (cube-level ring,
gates 7 of every 8 PEs to silence).
Each panel runs on a fresh per-config GraphEngine, then op_log is
summarized into gemm/dma/ipcq counts. Both decode panels emit exactly
2*n_ranks GEMMs (one-shot partial attention per rank, ADR-0056 D3).
v1 supports GQA_VALIDATION=1 only; headline mode + figures deferred to
sub-cycles 4b/4c. Sentinel tensor satisfies the run_bench
"at least one request" contract (ADR-0045 D4 / ADR-0054 D2 carve-out).
Tests: tests/attention/test_milestone_gqa_llama70b.py — all 12 pass.
Includes committed sweep.json baseline at the bench's _OUTPUT_DIR so
subsequent test runs reuse it instead of re-simulating.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ADR-0059 single_user_* panels run the ring across PEs in one cube
(rank == tl.program_id(axis=0)). multi_user_* panels run the ring
across cubes — rank should be cube_id (axis=1), and 7 of every 8 PEs
in each cube must stay silent because the cube-level SFR install only
gives the cube-coordinate PE 0 an E/W neighbor.
Add ``rank_axis: int = 0`` kwarg to both ``attention_mesh_mlo_kernel``
and ``attention_mesh_kv_kernel``:
- 0 (default): rank == tl.program_id(axis=0). Existing single_user
behavior, all spec tests unchanged.
- 1: gate ``if tl.program_id(axis=0) != 0: return`` at kernel start,
then ``rank = tl.program_id(axis=1)``. multi_user_* panels pass
this to the kernel via ctx.launch positional arg.
Also brings in _attention_mesh_kv.py and _attention_mesh_mlo.py as
the committed home of the ADR-0059 kernels (previously living
uncommitted in the working tree from sub-cycle 4b).
Tests: 7-test rank_axis spec file (default-path + rank_axis=1 gating
and cube-id semantics, both kernels); 4-panel diag harness now green
end-to-end (single_user_prefill/decode + multi_user_prefill/decode);
763-test wider sweep clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two compounding bugs in ctx.launch's dim-translation path surfaced
by multi_user_* panels of milestone-gqa-llama70b (sub-cycle 4c step 2):
Bug A: _compute_local_shape divided by self._num_cubes (the topology's
cube count, 16 in default topology.yaml) instead of the DPPolicy's
effective num_cubes (4 for validation-scale multi_user). The tensor
allocator at context.py:471-484 already honored dp.num_cubes; the
parallel computation inside launch was out of sync. Fix mirrors the
allocator's eff_num_cubes precedence pattern.
Bug B: dim_map was keyed by value, so any scalar whose value
coincidentally equaled a global tensor dim got rewritten to that dim's
local value — e.g. d_head=64 colliding with K's global M=64 in
multi_user mode. Legacy bench kernels (va_offset etc.) rely on this
remap, so the fix is opt-out: ctx.launch(..., _auto_dim_remap=False)
preserves scalars exactly as passed. Default remains True.
Tests: 3 new dim-translation tests + 4-panel diag harness covers
single_user_* (PASS) and multi_user_* (advances to new SFR/axis layer
failure, tracked separately). va_offset + full attention spec suite
unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 cannot snapshot math-output sources at outbound send time
because math executes only in Phase 2 — so token.data stays None and
PE_DMA inbound can't write the recv slot. For own-sends this is harmless
(Phase 2 replay reads the stable scratch addr after math runs). For
forwarded sends in mesh kernels (ADR-0059), src_addr is a recv slot
that gets wrapped by later inbounds before this read's Phase 2 turn,
yielding a shape mismatch on the fallback MemoryStore.read.
Fix: DataExecutor maintains a per-slot, time-ordered, shape-keyed
history. Every ipcq_copy write appends (t_write, value) to the slot's
history; _resolve_read falls back to the most recent shape-matching
entry with t_write <= the consuming op's t_start. Applied uniformly
to _execute_memory, _execute_gemm, and _execute_math.
Secondary: OpLogger.record_end for math ops now prefers
TensorHandle.data carried by the input handle over a MemoryStore
re-read, closing the smaller record-end race covered by the new
test_op_log_input_snapshot_race.py unit tests.
Tests: 4 new race tests + 6 existing op_log + mesh decode diag +
mesh kv/mlo spec — all green. Full repo sweep: 760 passed (3
pre-existing failures unrelated: bench-registry list drift +
Windows Tkinter env).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per request, the milestone bench output is now tracked in git instead of
gitignored, so the figures/results are viewable on the remote:
- src/kernbench/benches/1H_milestone_output/gemm/ (3 PNGs + gemm_sweep.json)
- src/kernbench/benches/1H_milestone_output/ccl/ (3 per-topology PNGs,
buffer-kind PNG+CSV, FSIM comparison PNG, topology.png, summary.csv)
Drop the .gitignore rule; update ADR-0054 D3 + Negative (EN+KO) to say the
output is committed (regenerable by rerunning the bench). Artifacts produced
by full bench runs (milestone-1h-gemm non-FAST, milestone-1h-ccl).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the GEMM + allreduce sweep/render logic out of scripts/ and tests/
into two self-contained eval benches so a user can regenerate every
result + figure with one command:
kernbench run --bench milestone-1h-gemm (MILESTONE_FAST=1 reuses JSON)
kernbench run --bench milestone-1h-ccl
- benches/milestone_1h_{gemm,ccl}.py: single home for each domain; the
run(torch) entry drives the sweeps and writes figures into
benches/1H_milestone_output/{gemm,ccl}/ (gitignored), then submits a
sentinel tensor to satisfy the run_bench contract.
- tests/gemm + tests/sccl helpers and scripts/gemm_sweep.py become thin
re-export/wrapper shims over the benches (single source preserved); the
pytest-only param builders + _run_distributed wrapper stay in the shim.
- eval-bench pattern: a bench may drive many configs + build its own
per-config engines (extends ADR-0045 D5; reverses ADR-0044 D1/D2).
ADR-0054 (EN+KO) records the design; ADR-0043/0044/0045 + CLAUDE.md CLI
Semantics amended; ADR INDEX regenerated. Verified: milestone benches run
clean (ok=True, all artifacts), full suite 67 passed, lang-pairs OK.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a section-based table of contents for the 46-ADR corpus, mirroring
the /report skill's classification (Design Principles / High-level
Architecture / Detailed Architecture by component / Implementation
Decisions by topic). Generated for both docs/adr/ (EN titles) and
docs/adr-ko/ (KO titles) from one tool.
tools/generate_adr_index.py:
- Single CLASSIFICATION dict per ADR — add an entry when introducing a
new ADR; the script fails loud if any file is missing from the table.
- DETAILED_COMPONENTS lists each builtin component and the ADR(s) that
cover it (ADR-0014 appears under six PE engines; ADR-0023 under
pe_dma + pe_ipcq).
- Accepts both ":" and "—" title separators (matching ADR-0033's
existing format).
- --check mode for CI: exits 1 if INDEX.md is stale.
Also includes the docs/report/architecture-2026-1H.md generated by the
prior /report write (the public-facing architecture document; 836 lines,
76 source-attribution comments).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents four cross-cutting surfaces one layer deeper than the prior
G4 batch:
- 0050 par-ccl-algorithm-module-contract: how to author a new CCL
algorithm in src/kernbench/ccl/algorithms/. Pairs with ADR-0045's
bench-module contract. Pins the four required public symbols
(kernel, kernel_args, TOPO_NAME_TO_KIND constants, kernel alias),
the 9 + tl standardized kernel signature, the kernel_args tuple
format, sip_topo_kind dispatch, and the ccl.yaml entry workflow.
- 0051 lat-routing-helper-api: every public method of AddressResolver
(resolve, find_m_cpu, find_pcie_ep, find_io_cpu, find_all_pcie_eps)
and PathRouter (find_path, find_path_with_distance,
find_mcpu_dma_path, find_memory_path, find_node_path + 2 shims).
Pins the four adjacency graphs (_adj_all / _adj / _adj_mcpu_dma /
_adj_local) and the edge-kind exclusion sets they use, plus the
single-owner naming convention.
- 0052 dev-oplog-memory-store-schemas: OpRecord's 7 fields, the
per-op_name params matrix (dma_read, dma_write, gemm_*, math, math
reduction, composite_gemm, ipcq_copy, unknown), snapshot timing
rules (math = all inputs, dma_write = HBM-only — ADR-0027 race
avoidance), TileToken stage_type capture, and MemoryStore's
(space, addr) two-level dict with reference-store semantics.
- 0053 dev-topology-builder-algorithms: the 6-stage compile pipeline,
cube_mesh.yaml's source_hash cache and its 5 input fields, the
cube NoC auto-layout algorithm (row/col placement, HBM exclusion
zone, PE/M_CPU/SRAM attachment via nearest-router, UCIe N/S/E/W
distribution), the node naming convention (single-owner with
router.py), the edge-kind catalog, the 4 view projections, and a
table of spec-field changes vs mesh regeneration.
Bilingual pair verifier passes for all four EN/KO pairs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents four cross-cutting surfaces that previously had no ADR backing,
each surfaced as a G4 candidate by /report:
- 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates
all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...),
the two execution modes (command-list vs greenlet runner), scratch
allocator semantics, dispatch-overhead model, and the kernel registry.
- 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group
(backend="ahbm") install path. world_size priority (algorithm >
defaults > topology), the 4-step init sequence (load ccl.yaml, import
algorithm module, derive world_size, install SFR + IPCQ), greenlet-
local rank registry, all_reduce dispatch via _defer_wait, barrier
no-op rationale, and the explicit list of unsupported dist.* APIs.
- 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator
free-list semantics. Offset-keyed first-fit with coalescing, the
no-validation trust model for free(), HBM/TCM channel separation,
page-aligned VA allocation, the page_size dual-default
(VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and
one-allocator-per-sub-unit rule.
- 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog.
H2D / D2H / PE DMA categories with their exact cube-index choices,
the 32 KiB reference size, the 5-point utilization sweep, the
formula vs actual column meanings, automatic invariant checks
(monotonicity, D2H >= H2D, best < worst), per-case GraphEngine
isolation, and the human-readable (not machine-parsable) output
contract.
Bilingual pair verifier passes for all four EN/KO pairs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents src/kernbench/benches/: how @bench registration + audit work,
how the CLI dispatches via run_bench/RuntimeContext, and the contract a
new bench module must satisfy.
Nine decisions (D1-D9) cover:
- @bench name/description rules and duplicate detection
- Module-file convention (_-prefixed helpers vs bench modules)
- def run(torch) signature; torch = RuntimeContext
- Minimum-one-submit rule (else NO_REQUESTS)
- Single-device convention + multi-SIP CCL exception (ADR-0024/0027)
- resolve() name/index decision tree; indices are not a stable API
- Exact RuntimeContext surface exposed to benches
- Env-var parameterization (matmul_composite / gemm_sweep.py pattern)
Four alternatives rejected with documented reasons (manifest YAML,
decorator entry= arg, @multi_device_bench split, stable indices).
Verifier (tools/verify_adr_lang_pairs.py) passes for EN/KO pair.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Document the allreduce + GEMM evaluation harnesses and bring the affected
allreduce ADRs in line with the refactored code.
New (Accepted, EN + KO):
- ADR-0043 — allreduce evaluation harness (tests/sccl/): distributed-driven
correctness, latency/buffer-kind sweeps, sessionfinish plot aggregators,
topology + FSIM-comparison figures. Verified against the implementation.
- ADR-0044 — GEMM evaluation harness (scripts/gemm_sweep.py + tests/gemm/):
heavy-script data gen vs. fast test-rendered figures, slow regenerator,
the 3-figure set. Records two limitations as open questions: the
theoretical-model constants are inherited (not yet traced to ADR-0033/
0014), and the *_measured figure is a naming misnomer.
Updated (EN + KO):
- ADR-0024 — add D5: SIP grid w/h resolution (explicit sips.w/h, square
fallback, fail-loud), documenting the AhbmCCLBackend fix.
- ADR-0032 — D4/D5/Non-goals reconciled: rectangular SIP grids (e.g. 6 SIPs
as 3x2) are supported via explicit w/h; the square requirement now
applies only to the fallback. Affected-files repointed to tests/sccl/.
Verification: ADR-0023 and ADR-0042 confirmed still matching the code (no
change). verify_adr_lang_pairs.py passes (EN/KO Status blocks byte-equal).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror the sccl pattern for GEMM figures: a tests/gemm/ package renders the
GEMM bar charts as PNGs from the committed docs/diagrams/gemm_sweep.json, so
the figures are fast test artifacts (run by default) while the heavy sim sweep
stays a manual script (scripts/gemm_sweep.py, kept) wrapped by a slow
regenerator test.
tests/gemm/:
- _gemm_plot_helpers.py: matplotlib renderers (series logic mirrors the
GEMM _render_* functions in scripts/build_overview_slides.py).
- test_plot_gemm_stage_breakdown.py: gemm_stage_breakdown.png (load_ref).
- test_plot_gemm_mac_utilization.py: gemm_mac_utilization_measured.png +
gemm_mac_utilization_theoretical_vs_measured.png (load_ref).
- test_gemm_sweep.py: @pytest.mark.slow regenerator (runs scripts/gemm_sweep.py).
Chart set trimmed to three (stage breakdown, MAC util, theoretical-vs-measured);
"formula" relabeled to "theoretical" throughout the comparison chart.
Known follow-ups (not blocking):
- gemm_mac_utilization_measured.png currently plots the theoretical ideal-
pipeline model, not simulator-measured data; the name is a misnomer pending
a decision to repoint its content or retitle.
- The theoretical-model constants (HBM 256 GB/s, T_stage 16 ns, 3 stages) are
inherited verbatim from build_overview_slides.py and not yet verified against
ADR-0033 / ADR-0014 / topology.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Convert the multidevice allreduce correctness + latency/buffer-kind sweeps
to run through the real PyTorch-distributed path
(init_process_group(backend="ahbm") -> mp.spawn -> dist.all_reduce) instead
of direct ctx.launch, and reorganize the CCL/allreduce tests into a
tests/sccl/ package split one test per file.
Production change (required for the distributed path on non-square SIP grids):
- AhbmCCLBackend now reads explicit system.sips.w/h from the spec, with a
square-only sqrt fallback that raises on ambiguity, instead of silently
guessing round(sqrt(count)). This fixes the 2x3 / 3x2 torus + mesh cases,
which previously resolved to a wrong 2x2 grid. Mirrors the test helper's
_sip_topo_dims precedence (explicit w/h > square fallback > raise).
Test reorganization (tests/sccl/):
- _allreduce_helpers.py: shared plumbing (distributed driver, config writers,
direct-launch run_allreduce parity reference, sweep/buffer-kind constants,
plot aggregators, topology-diagram + FSIM-comparison emitters).
- test_allreduce_ring_torus_mesh.py: correctness across ring/torus/mesh.
- test_distributed_default_topology.py: full distributed path on topology.yaml.
- test_plot_latency_sweep.py / test_plot_buffer_kind_sweep.py: sweep rows.
- test_plot_topology_diagram.py / test_plot_comparison_fsim.py: plot emitters.
- test_intercube_root_center.py: moved in (ADR-0032 center-root latency guard).
Also:
- Move the FSIM comparison plot generator out of scripts/ into the sccl suite.
- Delete superseded test files (test_allreduce_multidevice,
test_distributed_lrab_hierarchical_allreduce, test_allreduce_buffer_kind_sweep)
and repoint conftest aggregators + the ipcq buffer-kind importers.
- Regenerate the allreduce_latency_plots derived artifacts from the full sweep.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rename the intercube all-reduce identity to lrab_hierarchical_allreduce
(module, config key, distributed test) so the name reflects both levels
it implements: LRAB intra-SIP (local reduce to center root + broadcast)
and the hierarchical inter-SIP topology exchange (ring/torus/mesh).
ADR-0032 slug kept as the stable decision id; pure rename, no logic change.
Also in this batch:
- ADR-0032 (EN+KO): document the shipped center-root bidirectional reduce
(doc was stale corner-root); annotate ccl.yaml root_cube as a placeholder.
- Rename allreduce + pe2pe latency plots to descriptive, title-matching
filenames and retitle the in-plot headings; drop overview/overview_log.
- Point the PPTX image refs at the new plot names.
Doc + derived-artifact + rename only; no simulation behavior changed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace BenchResult.engine: object | None with GraphEngine | None via
TYPE_CHECKING import (avoids circular import at runtime). Cast the
topology argument to TopologyGraph at the GraphEngine call site for
the duck-typed engine_factory. Fixes Pylance reportAttributeAccessIssue
warnings on op_log and topology arg. Type annotations only; no runtime
behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fill component-model coverage gaps surfaced by /report's G4 analysis.
Each ADR documents the component's First action, latency model, and
honest notes on dormant code or implementation asymmetries discovered
during re-evaluation against current code.
- 0038 pcie_ep: thin protocol-overhead model; ComponentBase forwarding
worker as-is; named-node contract for router helpers
- 0039 pe_mmu: component + utility dual role; sub-page region stopgap;
D2.1 flags pipeline path missing mmu.overhead_ns timeout (asymmetric
with non-pipeline; not visible at default tlb_overhead_ns=0)
- 0040 pe_tcm: dual-channel BW serialization (read/write Resource cap=1);
TcmRequest schema owned by TCM; timing-only (no data store)
- 0041 sram: terminal scratchpad model + ResponseMsg on reverse path;
D1.1 flags _worker override as currently dormant (no Transaction
actually targets the SRAM node today)
- 0042 tiling: pure plan-generator module, not a component; corrects
the G4 misclassification; pins GEMM/Math stage sequences and
epilogue scope contract
Also: /report skill G3 refinement — only flag older->newer asymmetric
cross-references; newer->older (e.g., 0034-0037 citing infrastructure
ADRs) are expected one-way and no longer reported.
Bilingual pair verifier (tools/verify_adr_lang_pairs.py) passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move benches/ -> src/kernbench/benches/ and src/kernbench/cli/probe.py ->
src/kernbench/probes/probe.py. Each bench self-registers via
@bench(name=..., description=...); kernbench list enumerates benches
with auto-assigned indices, --bench accepts kebab-case name or numeric
index. Audit at package-import time fails if any non-underscore module
forgets the decorator. ADR-0010 (EN + KO) updated to reflect the new
resolver path, list subcommand, and probes package separation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the bilingual-structure commit: docs/adr-ko/ now holds
only Korean versions (24 files translated from English placeholders),
ADR-0013 slug uses kebab-case in both folders, and the verify tool
allows translated parenthetical commentary in the Status block.
- Translate 24 English files in docs/adr-ko/ to Korean. The previous
bilingual-structure commit had left these as English copies because
their source content was already English; this commit fulfills the
policy that docs/adr-ko/ contains only Korean.
- Rename ADR-0013 in both adr/ and adr-ko/ from
ver-verification_strategy.md to ver-verification-strategy.md
(kebab-case consistency with other ADRs).
- CLAUDE.md (ADR Translation Discipline): clarify that only the
Status lifecycle keyword (Accepted / Proposed / Stub / Draft /
Superseded by ADR-NNNN / Merged into ADR-NNNN) must match across
EN and KO; parenthetical commentary and trailing list items may be
translated.
- tools/verify_adr_lang_pairs.py: replace byte-equal Status check
with normalize_status_keyword() which strips parenthetical
commentary and takes only the first non-empty line.
- tests/test_verify_adr_lang_pairs.py: update existing test names,
add coverage for translated parenthetical, translated trailing
list, and Superseded-by-NNNN keyword equality.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Establish English as the canonical ADR language with Korean translations
held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror).
Promotion from adr-proposed/ to adr/ now writes English to adr/ and the
Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md.
- Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English,
2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix
dropped). ADR-0023 EN regenerated against KO source which had newer
HW Realization Notes (D16-D23) section.
- docs/adr-history/ left frozen by design (transitional state).
- CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark
docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline
section covering bidirectional sync, conflict resolution (EN wins),
and proposed-language freedom.
- tools/verify_adr_lang_pairs.py: new verification tool checking pair
completeness, filename mirroring, ADR-ID match, Status byte-equality.
Pre-commit hook intentionally not added; run on demand or in CI.
- tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF
normalization, em-dash title separator, underscore-slug edge case.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- CLAUDE.md: add ADR Lifecycle subsection (superseded → docs/history/,
immutable numbering, no renumber)
- ADR-0011: merge ADR-0018 content as "Address Model: LA" section
alongside PA / VA; status notes VA model is currently implemented
- ADR-0018 / 0029 / 0031: moved to docs/history/ with status updates
(0018 merged into 0011, 0029 superseded by 0032, 0031 absorbed
into 0001 rev 2)
- ADR-0019: rewrite Context as PE-HBM connectivity decision
(self-contained, no LA model framing)
- ADR-0019/0020/0021/0023/0025/0027: Status Proposed → Accepted
(code verified) and prune Implementation Notes / Affected files /
Test strategy / "현재 상태" sub-sections describing pre-impl state
- ADR-0024/0026: same migration-flavor cleanup; 0026 also drops D6
Migration and D8 docs-update sub-decisions
- ADR-0030: status simplified (blocker ADR-0031 now superseded)
- SPEC.md: R10 + §0.2 reflect PA / VA / LA model names
- ADR-0008/0012/0013: refresh ADR-0011 subtitle in Links
21 files changed, 553 insertions(+), 1290 deletions(-).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Reorganize rules into reusable general behavior vs kernbench-specific
foundations + rules
- Add Surfacing Choices, Coding Style (Simplicity First, Surgical Changes),
Mental Model, Common Failure Modes
- Clarify Phase 1 forbidden vs permitted-for-discussion (pseudocode,
sketches allowed; final ready-to-apply diffs are Phase 2 only)
- Tighten dead-code handling: mention + options before deletion
- Drop redundant "SPEC.md and ADRs are the final authority" from
Enforcement Defaults (already in Authority & Scope)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extend tl.composite() with an ordered epilogue list. Each op carries
a scope flag - output_tile (default, runs once per (m,n) before
STORE), k_tile (every K-tile right after GEMM), or kernel. Plan
generator slots MATH stages by scope; pe_math reuses pe_dma's
local-loop pattern so chained epilogues (bias->relu) skip the port
hop. op_log captures per-stage params for telemetry. Topology
gains a gemm->math edge (snapshot test updated).
API stays backward-compatible - `epilogue=` is opt-in.
Example:
h = tl.composite(
op="gemm", a=a, b=b, out_ptr=int(out),
epilogue=[
{"op": "dequant", "scale": s_per_k, "scope": "k_tile"},
{"op": "bias", "bias": bias_vec},
{"op": "relu"},
{"op": "scale", "factor": 0.5},
],
)
tl.wait(h)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User asked to surface system-wide congestion (more accurate than
single-cube), bring back the latency-breakdown plot under a separate
filename, and rename the obscure ``streaming`` category.
Scenarios:
Renamed all_pe_to_pe0 → all_pe_cube0_to_pe0 (clarify cube scope).
Added two SIP-wide scenarios:
sip_local_all — every PE in sip0 (128 total) accesses its own
local slice. All paths disjoint (each PE owns
its own hbm_ctrl.peX), so the model should
scale linearly with cube count.
sip_hotspot_pe0 — every PE in sip0 (128 total) targets
sip0.cube0.pe0_slice. Worst-case hotspot:
UCIe inbound + r0c0→hbm_ctrl.pe0 saturated.
Each bar now carries an ``N=...`` annotation showing the issuer
count, and the chart titles say the scope explicitly.
Effective BW + util at 16 KB:
sip_local_all N=128 eff= 27.2 TB/s util_a= 83 %
sip_hotspot_pe0 N=128 eff= 134 GB/s util_a= 93 %
(UCIe-into-cube0 saturated)
Plots:
no_congestion.png + congestion.png — Effective BW utilization
(two bars: single vs aggregate peak)
breakdown_no_congestion.png +
breakdown_congestion.png — stacked latency breakdown
(renamed from previous)
summary.csv with columns for both views.
The visual y-cap on BW utilization is 150 %. Bars exceeding it (e.g.
sip_local_all's util_single = 10,639 %) are drawn at the cap with an
upward arrow and the real value annotated. The verification rule for
``util_single`` is loosened to ``≤ n_issuers × 100 % + 5 %`` so
massively-parallel disjoint scenarios pass.
Category renamed: ``streaming`` → ``wire_transfer``. It is the
bulk-transfer time = (n_flits − 1) × flit_bytes / bottleneck_bw — the
cost of streaming the rest of the payload through the slowest wire
after the first flit has arrived.
All checks PASS.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each scenario now shows TWO bars:
util_single = effective_bw / single-path peak × 100
(peak = min bw_gbs on first issuer's path)
util_aggregate = effective_bw / aggregate-resource peak × 100
(peak = max-min fair share across concurrent paths)
Aggregate peak uses a max-min fair-share computation: each concurrent
path's sustainable share on an edge is bw_gbs / usage_count, the
per-path throughput is the min share along its edges, and the aggregate
peak is the sum across paths. This produces the correct answer for both
shared-bottleneck scenarios (N paths converge on one wire → aggregate =
wire BW) and multi-lane shared resources (UCIe's 4 connections used in
parallel → aggregate ≈ 4 × per-conn BW), without enumerating max-flow.
Single-issuer (no_congestion) → util_single == util_aggregate by
definition. Congestion exposes the divergence:
ctrl_hot_{1,2,3}, all_pe_to_pe0 → both metrics agree (one shared
bottleneck: r0c0→hbm_ctrl.pe0 @ 256 GB/s)
8×PE eastbound → util_single=106 % (single conn @ 128 GB/s) but
util_aggregate=85 % (UCIe-W.conn0 @ 7-way shared,
aggregate peak ≈ 160 GB/s under the current
cross-cube routing that funnels via cube1.r0c0).
Verification updated to assert:
(2) util_aggregate ≤ 100 % (effective BW can't exceed the aggregate
resource peak, by construction).
(3) single-issuer util_single == util_aggregate.
(7) ucie_eastbound: util_aggregate is meaningfully smaller than
util_single (the multi-lane peak correction is observable).
CSV grows with peak_aggregate_bw_gbs and util_aggregate_pct columns;
breakdown columns retained.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the latency-breakdown stacked bars with a single utilization
bar per scenario. Each bar shows ``effective_bw / peak_bottleneck_bw``
with both values annotated, and a horizontal "single-path peak" line at
100 %. The colour band (green ≥70 %, amber ≥40 %, red <40 %) makes the
no-congestion distance roll-off scannable at a glance.
Definitions:
effective_bw = (total bytes transferred) / wall-clock time
no_congestion: nbytes / total_ns
congestion: n_issuers × nbytes / makespan_ns (aggregate)
peak_bw = min(edge.bw_gbs) on first issuer's path
util_pct = effective_bw / peak_bw × 100
The congestion graph shows that 8×PE eastbound exceeds 100 % of a
single-path peak (106.4 %): UCIe-N's 4 connections × 128 GB/s give
512 GB/s of aggregate eastbound capacity, so concurrent issuers across
disjoint conns sum past any single conn's 128 GB/s. The 8×PE→pe0_slice
hotspot reaches 91.7 %, almost saturating the shared r0c0→hbm_ctrl.pe0
bottleneck — the simulator's address-based PC striping + per-flit
arbitration model amortises the cost cleanly.
Self-verification updated to BW invariants:
(1) effective BW shrinks as topological distance grows
(2) util_pct ∈ (0, 250 %]
(3) single-issuer util_pct ≤ 100 %
(4) effective_bw = nbytes / total_ns for single requests
(5) congestion aggregate BW grows monotonically with issuer count
on the hot-target series
(6) 8-PE all-hit-pe0 saturates ≥ 70 % of shared peak
All checks PASS at the current model.
The CSV retains all breakdown components (pe_setup, noc_mesh, ucie,
fabric, streaming, hbm_ctrl, contention) so a future replot can still
recover the latency-breakdown view without re-running the simulator.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/plot_pe_dma_perf.py runs the simulator across six
no-congestion scenarios (SAME_CUBE_PE_LOCAL / REMOTE_BEST /
REMOTE_WORST, REMOTE_CUBE_BEST / REMOTE_WORST, REMOTE_SIP) and
five congestion scenarios (1/2/3 PE hot-target, 8-PE corresp.
cube-to-cube, 8-PE all-hit-pe0). It categorises actual total /
makespan into pe_setup, noc_mesh, ucie, fabric, streaming,
hbm_ctrl, and a contention residual using a wormhole-pipelined
model (first-flit arrival + (n_flits-1)/bottleneck + final
chunk_time).
Outputs:
docs/diagrams/pe_dma_perf/no_congestion.png — single-PE latency
by topological distance. Visualises monotonic growth from
SAME_CUBE_PE_LOCAL (77 ns) up to REMOTE_CUBE_PE_REMOTE_WORST
(573 ns) and REMOTE_SIP (409 ns).
docs/diagrams/pe_dma_perf/congestion.png — makespan as concurrent
issuer count grows. ctrl_hot_{1,2,3}=82/158/230 ns; 8-PE
eastbound UCIe = 963 ns; 8-PE all-hit-pe0 = 558 ns.
docs/diagrams/pe_dma_perf/summary.csv — raw rows for re-plotting.
Built-in --verify harness asserts:
(1) distance monotonicity for no-congestion;
(2) same-cube paths contain zero UCIe budget;
(3) remote-cube/SIP paths carry positive UCIe budget;
(4) breakdown is internally consistent (formula ≤ actual);
(5) streaming term matches (n_flits-1) × flit_bytes /
bottleneck_bw within 5 % for the local scenario;
(6) congestion makespan is monotonic in issuer count;
(7) 8-PE hotspot strictly exceeds 3-PE hotspot.
Cross-SIP gets a looser 70 % contention slack because the path
crosses two non-flit-aware (pcie_ep) boundaries that force
store-and-forward re-streaming the simple formula does not
attribute. Single-cube scenarios stay under 25 % residual.
All checks PASS at the current model (post ADR-0019 D1/D4
per-PE HBM CTRL restoration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restores per-PE HBM controller partitioning that was lost in
commit 5917b34 ("Replace xbar/bridge/single-NOC with explicit
router mesh"), which had over-consolidated the per-slice HBM CTRL
into a single cube-wide ``hbm_ctrl`` connected to every router —
the opposite of what ADR-0019 D1/D4 specifies.
Builder splits ``hbm_ctrl`` into 8 ``hbm_ctrl.pe{X}`` instances per
cube, each reachable ONLY through PE_X's attaching router via the
existing ``peX.hbm`` attach metadata from cube_mesh.yaml. Cube
aggregate BW now matches the spec (8 PEs × 8 PCs × 32 GB/s =
2048 GB/s) instead of collapsing to 256 GB/s.
AddressResolver decodes the target PE from the HBM PA's hbm_offset
(``offset // slice_size``) and returns ``hbm_ctrl.pe{X}``. PathRouter
uses the existing ``_adj_local`` adjacency for same-cube PE_DMA so
the cube's own UCIe port can no longer appear as a zero-distance
shortcut between routers — local PE_DMA now traverses the mesh,
restoring the ADR-0019 D4 worked example
``PE0.pe_dma → r0c0 → … → r1c4 → hbm_ctrl``.
Tests:
- New tests/test_per_pe_hbm_partition.py: 14 tests covering
topology shape, per-PE router exclusivity, PA resolution,
single-hop local path, cross-PE mesh traversal, and end-to-end
latency monotonicity. Probe CLI now reports
pe-local < pe-same-half < pe-cross-half (was uniform 141ns).
- Existing tests updated for new node ids and replaced two
assertions that locked in the wrong consolidation:
test_noc_mesh.test_hbm_connects_to_all_routers and
test_topology_compile.test_hbm_ctrl_connects_all_routers are
now per-PE exclusivity assertions; test_routing
.test_all_pe_hbm_equidistant becomes
test_cross_pe_hbm_distance_increases_with_mesh_hops.
- test_ipcq_buffer_kind_locations.test_hbm_pe_hop_charged_at_large_payload
threshold recalibrated 4000→1500 ns: the prior figure reflected
serialization on the over-consolidated single hbm_ctrl; per-PE
partitioning removes that artificial contention so the gap
shrinks to the genuine PE↔HBM-hop cost.
Full suite: 645 passed, 1 skipped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces global round-robin with deterministic address-derived PC
striping:
pc_shift = log2(burst_bytes)
pc_mask = num_pcs - 1
pc = (flit.address >> pc_shift) & pc_mask
Each Transaction carries base_address (HBM byte offset of the first
chunk); each Flit derives its own address as base + i*flit_bytes.
HBM CTRL routes flits to PCs via this formula, replacing the
arrival-order RR pointer. Also splits the is_last wait into an
asynchronous _finalize_txn process so the worker isn't blocked on
PC commit, exposing true PC parallelism for disjoint addresses.
phyaddr.py documents the canonical bit layout (bits [10:8] for the
default burst=256, num_pcs=8 case). ADR-0033 D6 records the
derivation and the workload scenarios where address-striping
matters (strided streams, offset-disjoint parallel transfers).
Adds tests/test_hbm_address_based_pc.py: canonical bit mapping,
strided 8-way load distribution, same-address PC-0 serialization,
PC-aligned 2KB pair collision, dynamic pc_shift from burst_bytes,
and power-of-2 attr validation. Integration tests inspect
_pc_avail ledger directly: at default config UCIe's 8 ns per-txn
overhead exactly matches chunk_time, masking PC contention at the
makespan level even though the ledger correctly distinguishes the
cases.
Full suite: 631 passed, 1 skipped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Allreduce + pe2pe + ipcq + pe_view auto-regenerated by test sweeps
running against the new chunk-streaming wire timing (per-flit
wormhole) — absolute numbers shift upward to reflect bottleneck-link
transit charged once per flit (instead of the previous cut-through
subtraction at HBM CTRL).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Standalone summary of the modeled hardware hierarchy and components.
Cross-references ADR-0003, 0004, 0014, 0017, 0022.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NOC topology is an implementation choice (mesh, ring, crossbar, etc.).
ADR-0017 covers the current 2D mesh choice; ADRs at the system-level
shouldn't bind to that specific implementation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cycle-accurate arbitration policies (priority/iSLIP) downgraded to
"academic / specific use cases" — FIFO inbox is approximately fair
for typical similar-rate workloads (GEMM, AllReduce, data parallel).
True impact appears only for QoS modeling or per-stream tail latency
analysis under saturation.
Higher-priority items pulled forward: address-based PC selection at
HBM CTRL (directly affects multi-PE concurrent HBM contention), bank
conflict modeling, HBM scheduler, finite buffer backpressure, op_log
chunk-streaming integration.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier the future-work list mentioned "multi-flow fair sharing on a
single shared link" which was confusing — each wire has a single
source, so this isn't a real gap. The actual modeling story:
- Multi-stream merging at routers IS handled via per-in_port fan_in +
shared inbox + FIFO worker forwarding. Flits from different
upstream streams interleave at flit granularity naturally.
- What's NOT modeled: cycle-accurate arbitration policies (priority,
iSLIP), address-based PC selection at HBM CTRL (round-robin is
address-blind, so size-aligned concurrent transactions hit full
PC contention even when real-HW address striping would diverge),
sub-flit (32B) granularity, finite buffer backpressure, and bank
conflict modeling.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- test_op_log_per_transaction_not_per_flit (renamed from
..._records...): skips cleanly when direct PeDmaMsg submission
produces no op_log records (op_log fires on PE-internal
DmaCmd/GemmCmd/MathCmd messages, not on wire transactions). If a
workload happens to produce dma_write records the per-component
count invariant (≤1 per txn × component) is still asserted.
- ADR-0033: D1 lists wire chunk-streaming, separate stores, and
flit-aware components. D2/D3/D4 updated for new wire model.
D6 future work notes op_log full integration with chunk-streaming.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- test_h2d_local_cube_cut_through: threshold 65 → 80ns. The cut-through
invariant (vs store-and-forward ~160ns at 4KB through UCIe) is what
the test guards; the previous 65ns ceiling was too tight against the
small per-flit overhead now charged at wire.
- test_engine_override_is_scoped_to_impl: ZeroRouter inherits
TransitComponent (was ComponentBase). Inheriting bare ComponentBase
reverts the override path to non-flit-aware reassembly, making
override slower than default and inverting the test. The test's
intent is overhead=0 vs overhead=2, not flit-awareness.
- test_intra_sip_critical_path_at_96k_below_threshold: threshold
20.5 → 30 µs. Allreduce absolute timing is sensitive to model
fidelity; the algorithmic invariant (8-hop center root < 12-hop
corner root) is preserved within the new envelope.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of Phase 2c-1 timing collapse identified: src.out_port and
dst.in_port aliased the same simpy.Store, so when wire chunkified a
Transaction into Flits and re-put them, fan_in could pull flits before
the wire applied bw delay — half the flits bypassed bottleneck timing.
Fix: separate Stores per directed edge. Wire is the only conduit. Each
flit on the wire incurs chunk_time = flit_nbytes/bw_gbs once, in arrival
order. Multi-hop wormhole pipelining emerges naturally because
flit-aware pass-through (TransitComponent) forwards each flit serially
without reassembly.
64 KB MemoryWrite via UCIe 128 GB/s bottleneck: 273 ns (broken) → 545 ns
(matches drain 512 + commit 8 + path overheads). 1 MB: 8230 ns (matches
drain 8192). Single-flit transfer transport-time alone, exactly what
real-HW wormhole produces.
3 pre-existing tests now off by small margins or inverted:
- test_h2d_local_cube_cut_through: 65.53 vs threshold 65.0
- test_engine_override_is_scoped_to_impl: ZeroRouter inherits
ComponentBase, not flit-aware, so override path reassembles at each
hop while default doesn't
- test_intra_sip_critical_path_at_96k_below_threshold: 96KB allreduce
microscopically over its threshold
Not weakening these to pass: they reflect model fidelity improvements
that need calibrated thresholds. To address in follow-up via test
threshold updates and ZeroRouter→TransitComponent inheritance.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire decomposes Transactions into Flits per `_flit_bytes` but emits all
flits atomically at the same env.now — preserves single-msg timing as
infrastructure for Phase 2c-2 (per-flit timing + flit-aware routers).
Non-flit-aware components reassemble Flits in `_fan_in`; `_update_step`
sets txn.step to current component's path position so legacy
step-based routing continues working when upstream is flit-aware.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous model double-counted slow-upstream paths (e.g., 64KB via UCIe
128 GB/s was ~2x pessimistic). HBM CTRL now distributes bursts across
8 pseudo-channels via global round-robin, with per-chunk commit timing
that pipelines correctly against the bottleneck link's data arrival.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related issues caused measured pipeline efficiency to look
worse than the simulator's actual behavior:
1. DMA timing recorded too early. The op-log start timestamp
for a DMA op fired when the request entered the queue, and
the DMA channel was released as soon as the request was
issued. Back-to-back DMAs therefore appeared to grab the
channel simultaneously, with per-op duration drifting
upward as queue depth grew - an artifact, not real cost.
Fix: defer the start timestamp until after the channel is
acquired, and hold the channel through the full HBM
round-trip until the response returns. Per-op duration is
now constant and equal to the actual transfer interval;
serialization is visible as queue wait, not as inflated
service time.
2. Sweep timing window folded in pre-composite work. The PE
timing window spanned every PE engine record, which
included the upfront pinned-operand DMA issued before the
composite GEMM begins. For large-K shapes that one-shot
load can be nearly half of the window, conflating
operand-staging cost with composite-pipeline behavior.
Fix: add a second window scoped to the composite pipeline
by filtering op_log records to those tagged with a
tile-pipeline stage; the legacy operand-load path is
untagged and naturally excluded. For 32x3072x32 load_ref
the window drops from 1765ns to 992ns and measured eff
lines up with the steady-state DMA-bound stage limit
instead of being penalized for the one-time load.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/build_overview_slides.py renders a 5-slide PPTX
(kernbench2_overview.pptx) summarizing architecture, model
correctness, IPCQ, allreduce, and buffer-kind tier comparison.
scripts/emit_overview_with_external_ref.py renders log-y and
broken-y variants of the allreduce overview (overview_log.png,
overview_broken.png) including a 366 µs ext-sim reference marker
at 96 KB / PE.
Also includes cube_mesh_view.png rendered from the SVG.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pe2pe overview compared IPCQ (tl.send + tl.recv) against raw DMA
(tl.load + tl.store), but DMA is one-sided — DST never reads — while
tl.recv pays a slot-read on DST. The comparison was unfair: IPCQ
looked slower partly because it does more work.
Adds tl.recv_no_consume() — a separate, diagnostic-only entry point
that blocks for slot arrival but skips the slot-read (and bank-hop)
charge on DST. Production tl.recv is unchanged (no `consume` kwarg
on the public API), so the diagnostic flag can never accidentally
leak into real workloads.
Updates test_pe_to_pe_latency to call tl.recv_no_consume so the
overview.png shows IPCQ no-consume vs raw DMA on equal footing.
Also fixes PLOT_DIR back to docs/diagrams/pe2pe_latency_plots/
(was lost in a merge). Adds scripts/replot_pe2pe.py for label-only
re-renders without re-measuring.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cube SRAM and HBM live on the cube NoC behind router-attached links
(sram_to_router_bw_gbs=128, hbm_to_router_bw_gbs=256). Previously the
slot-IO model treated them as if they were per-PE local, so the
buffer_kind sweep showed TCM ≈ SRAM at 64 KB / PE.
pe_ipcq._handle_recv and pe_dma._handle_ipcq_inbound now charge a
PE→bank compute_drain_ns on top of the intrinsic slot-IO for SRAM/HBM.
TCM stays free of this hop. Adds an internal IpcqRecvCmd.consume field
that gates the recv-side hop+slot-IO charges (used by a follow-up
diagnostic API; default True keeps current behavior).
Post-fix at 64 KB / PE: TCM 12.0 µs < HBM 21.4 µs < SRAM 24.3 µs.
SRAM is slowest because its 128 GB/s bank link is the narrowest in
the system — narrower than HBM's 256 GB/s. The existing ordering test
is rewritten from tcm<sram<hbm to tcm<hbm<sram and a new
test_ipcq_buffer_kind_locations adds 3 invariants on the gap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This is the diagram generator that emits ipcq_send_recv.png and
ipcq_two_pe_dma.png (referenced by commit 1e39214 but accidentally
left untracked).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Conflict resolution:
- intercube_allreduce.py: kept origin's `if single_cube:` early-exit
(TP launches kernel on one cube/rank → skip intra-SIP mesh and go
direct to inter-SIP exchange) AND replaced the multi-cube body with
the local center-root + bidirectional reduce/broadcast (8-hop
critical path on 4×4 vs 12 with corner root).
- tests/{allreduce,pe2pe}_latency_plots/: kept the local move to
docs/diagrams/; dropped origin's stale content edits to the old
paths (regenerable derived artifacts).
- docs/diagrams/pe2pe_latency_plots/summary.csv: kept local
(post-Phase-2 + center-root values).
Origin contributions retained as-is:
- pyproject.toml: matplotlib >= 3.7 dep.
- runtime_api/distributed.py: derive effective cube_w/h from tensor
shard placement so single-cube TP paths get cube_w=cube_h=1.
- kernel_args() now accepts optional cube_w/cube_h kwargs.
Verified post-merge:
- test_intercube_root_center.py: 2/2 (center-root multi-cube path).
- test_tp_layers.py + test_tp_mlp.py: 10/10 (single-cube TP path).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the algorithmic root cube from the corner (cube_w-1,
cube_h-1) to the geometric center (cube_w//2, cube_h//2) and
have each phase converge bidirectionally so the intra-SIP
critical path drops from ~12 hops to ~8 hops on a 4×4 mesh
(left half W→E + right half E→W in row reduce; top half N→S +
bottom half S→N in col reduce; mirrored on broadcast).
Result on torus_2d 6 SIPs at 96 KB / PE on TCM:
before (corner root) : 22.0 µs
after (center root) : 17.2 µs (−22%)
Same shape on ring_1d (−7%) and mesh_2d_no_wrap (−12%); also
holds across SRAM and HBM (~−20% each).
Phase 1 test (test_intercube_root_center.py) asserts the
torus_2d 96 KB latency drops below 20.5 µs and that all 96
cubes still validate (correctness preserved).
Plot updates:
- overview.png: replace constant 10.6 µs theoretical line with
user-supplied hand-derived curve (per-cube packet count =
bytes_per_pe × 8 PEs ÷ 128 B; 1346 ns startup + 1.20 ns/pkt).
- All summary.csv numbers and per-topology PNGs regenerated.
- pe2pe_latency_plots and ipcq diagram emitter PNGs refreshed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>