kernbench2

Author	SHA1	Message	Date
ywkang	1f36baa898	ADR: add 0038-0042 (pcie_ep, pe_mmu, pe_tcm, sram, tiling) Fill component-model coverage gaps surfaced by /report's G4 analysis. Each ADR documents the component's First action, latency model, and honest notes on dormant code or implementation asymmetries discovered during re-evaluation against current code. - 0038 pcie_ep: thin protocol-overhead model; ComponentBase forwarding worker as-is; named-node contract for router helpers - 0039 pe_mmu: component + utility dual role; sub-page region stopgap; D2.1 flags pipeline path missing mmu.overhead_ns timeout (asymmetric with non-pipeline; not visible at default tlb_overhead_ns=0) - 0040 pe_tcm: dual-channel BW serialization (read/write Resource cap=1); TcmRequest schema owned by TCM; timing-only (no data store) - 0041 sram: terminal scratchpad model + ResponseMsg on reverse path; D1.1 flags _worker override as currently dormant (no Transaction actually targets the SRAM node today) - 0042 tiling: pure plan-generator module, not a component; corrects the G4 misclassification; pins GEMM/Math stage sequences and epilogue scope contract Also: /report skill G3 refinement — only flag older->newer asymmetric cross-references; newer->older (e.g., 0034-0037 citing infrastructure ADRs) are expected one-way and no longer reported. Bilingual pair verifier (tools/verify_adr_lang_pairs.py) passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 14:43:03 -07:00
ywkang	049e3d8bb3	benches: package as kernbench.benches, add @bench registry + list subcommand Move benches/ -> src/kernbench/benches/ and src/kernbench/cli/probe.py -> src/kernbench/probes/probe.py. Each bench self-registers via @bench(name=..., description=...); kernbench list enumerates benches with auto-assigned indices, --bench accepts kebab-case name or numeric index. Audit at package-import time fails if any non-underscore module forgets the decorator. ADR-0010 (EN + KO) updated to reflect the new resolver path, list subcommand, and probes package separation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 14:42:10 -07:00
ywkang	168b0c89f0	ADR: translate adr-ko/ to Korean, fix ADR-0013 slug, refine Status check Follow-up to the bilingual-structure commit: docs/adr-ko/ now holds only Korean versions (24 files translated from English placeholders), ADR-0013 slug uses kebab-case in both folders, and the verify tool allows translated parenthetical commentary in the Status block. - Translate 24 English files in docs/adr-ko/ to Korean. The previous bilingual-structure commit had left these as English copies because their source content was already English; this commit fulfills the policy that docs/adr-ko/ contains only Korean. - Rename ADR-0013 in both adr/ and adr-ko/ from ver-verification_strategy.md to ver-verification-strategy.md (kebab-case consistency with other ADRs). - CLAUDE.md (ADR Translation Discipline): clarify that only the Status lifecycle keyword (Accepted / Proposed / Stub / Draft / Superseded by ADR-NNNN / Merged into ADR-NNNN) must match across EN and KO; parenthetical commentary and trailing list items may be translated. - tools/verify_adr_lang_pairs.py: replace byte-equal Status check with normalize_status_keyword() which strips parenthetical commentary and takes only the first non-empty line. - tests/test_verify_adr_lang_pairs.py: update existing test names, add coverage for translated parenthetical, translated trailing list, and Superseded-by-NNNN keyword equality. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 08:17:56 -07:00
ywkang	a796c1d2f7	ADR: bilingual structure — EN canonical in adr/, KO mirror in adr-ko/ Establish English as the canonical ADR language with Korean translations held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror). Promotion from adr-proposed/ to adr/ now writes English to adr/ and the Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md. - Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English, 2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix dropped). ADR-0023 EN regenerated against KO source which had newer HW Realization Notes (D16-D23) section. - docs/adr-history/ left frozen by design (transitional state). - CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline section covering bidirectional sync, conflict resolution (EN wins), and proposed-language freedom. - tools/verify_adr_lang_pairs.py: new verification tool checking pair completeness, filename mirroring, ADR-ID match, Status byte-equality. Pre-commit hook intentionally not added; run on demand or in CI. - tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF normalization, em-dash title separator, underscore-slug edge case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:38:44 -07:00
ywkang	687c98086d	ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037 Filename + lifecycle: - ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable. - ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2: docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft), docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for retroactive docs pending verification. Merges (one ADR per topic, no change-history annotations): - ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items) - ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl. TileToken self-routing and multi-op composite epilogue scope) - ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md deleted; ADR-0019/0021 moved to adr-history with one-line stub status Retroactive documentation (G4 closures, code-verified): - ADR-0037 forwarding component (TransitComponent: first-flit overhead, serial worker, path-based routing, single impl/multiple names) - ADR-0036 IO_CPU component (target_start_ns global barrier stamping, per-cube fan-out, response aggregation) - ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources, target_start_ns passthrough) - ADR-0034 HBM controller internal design (per-PC state, address-based selection, flit-aware per-flit commit, async finalize, command-only fallback path) Content updates: - ADR-0010 expanded to full CLI surface (run/probe/web), retitled "Command Line Interface and Execution Semantics" - ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned - ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata block replaced with standard Status header - ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4); ADR-0027 cleaned of supersession history - ADR-0033 D6 cleanup: address-based PC selection moved out of future-work (now documented in ADR-0034 D3); related D1/D3 wording realigned - Cross-references back-filled in 5 ADRs (G3 gaps closed) Onboarding docs split: - docs/onboarding/ created - moved: hw-architecture-overview.md, latency-model.md, di-presentation.md, ccl-author-guide{,.en}.md - references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8). No behavior change. Tooling: - tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py (ADR EN/KO pair invariant checker) - .claude/commands/report.md tracked (/report slash command) - .gitignore: allow .claude/commands/*.md while keeping settings files ignored Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:15:55 -07:00
ywkang	22fd0d2b9d	ADR: introduce docs/history/, merge 0011+0018, prune migration cruft - CLAUDE.md: add ADR Lifecycle subsection (superseded → docs/history/, immutable numbering, no renumber) - ADR-0011: merge ADR-0018 content as "Address Model: LA" section alongside PA / VA; status notes VA model is currently implemented - ADR-0018 / 0029 / 0031: moved to docs/history/ with status updates (0018 merged into 0011, 0029 superseded by 0032, 0031 absorbed into 0001 rev 2) - ADR-0019: rewrite Context as PE-HBM connectivity decision (self-contained, no LA model framing) - ADR-0019/0020/0021/0023/0025/0027: Status Proposed → Accepted (code verified) and prune Implementation Notes / Affected files / Test strategy / "현재 상태" sub-sections describing pre-impl state - ADR-0024/0026: same migration-flavor cleanup; 0026 also drops D6 Migration and D8 docs-update sub-decisions - ADR-0030: status simplified (blocker ADR-0031 now superseded) - SPEC.md: R10 + §0.2 reflect PA / VA / LA model names - ADR-0008/0012/0013: refresh ADR-0011 subtitle in Links 21 files changed, 553 insertions(+), 1290 deletions(-). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 11:42:45 -07:00
ywkang	aaa1cbfaf6	ADR-0033 D6: address-based PC selection at HBM CTRL Replaces global round-robin with deterministic address-derived PC striping: pc_shift = log2(burst_bytes) pc_mask = num_pcs - 1 pc = (flit.address >> pc_shift) & pc_mask Each Transaction carries base_address (HBM byte offset of the first chunk); each Flit derives its own address as base + i*flit_bytes. HBM CTRL routes flits to PCs via this formula, replacing the arrival-order RR pointer. Also splits the is_last wait into an asynchronous _finalize_txn process so the worker isn't blocked on PC commit, exposing true PC parallelism for disjoint addresses. phyaddr.py documents the canonical bit layout (bits [10:8] for the default burst=256, num_pcs=8 case). ADR-0033 D6 records the derivation and the workload scenarios where address-striping matters (strided streams, offset-disjoint parallel transfers). Adds tests/test_hbm_address_based_pc.py: canonical bit mapping, strided 8-way load distribution, same-address PC-0 serialization, PC-aligned 2KB pair collision, dynamic pc_shift from burst_bytes, and power-of-2 attr validation. Integration tests inspect _pc_avail ledger directly: at default config UCIe's 8 ns per-txn overhead exactly matches chunk_time, masking PC contention at the makespan level even though the ledger correctly distinguishes the cases. Full suite: 631 passed, 1 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 00:18:46 -07:00
ywkang	32b29a1e5c	ADR-0003/0014: generalize "router mesh" to "NOC" NOC topology is an implementation choice (mesh, ring, crossbar, etc.). ADR-0017 covers the current 2D mesh choice; ADRs at the system-level shouldn't bind to that specific implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 23:23:46 -07:00
ywkang	c9bd5387ac	ADR-0033 D6: reorder future work by workload impact Cycle-accurate arbitration policies (priority/iSLIP) downgraded to "academic / specific use cases" — FIFO inbox is approximately fair for typical similar-rate workloads (GEMM, AllReduce, data parallel). True impact appears only for QoS modeling or per-stream tail latency analysis under saturation. Higher-priority items pulled forward: address-based PC selection at HBM CTRL (directly affects multi-PE concurrent HBM contention), bank conflict modeling, HBM scheduler, finite buffer backpressure, op_log chunk-streaming integration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 23:21:35 -07:00
ywkang	9beb140eaa	ADR-0033 D6: clarify what multi-flow merging actually models Earlier the future-work list mentioned "multi-flow fair sharing on a single shared link" which was confusing — each wire has a single source, so this isn't a real gap. The actual modeling story: - Multi-stream merging at routers IS handled via per-in_port fan_in + shared inbox + FIFO worker forwarding. Flits from different upstream streams interleave at flit granularity naturally. - What's NOT modeled: cycle-accurate arbitration policies (priority, iSLIP), address-based PC selection at HBM CTRL (round-robin is address-blind, so size-aligned concurrent transactions hit full PC contention even when real-HW address striping would diverge), sub-flit (32B) granularity, finite buffer backpressure, and bank conflict modeling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 23:18:19 -07:00
ywkang	c6788788a4	ADR-0033 Phase 2c-3 finish: op_log test + ADR doc reflect chunk-streaming - test_op_log_per_transaction_not_per_flit (renamed from ..._records...): skips cleanly when direct PeDmaMsg submission produces no op_log records (op_log fires on PE-internal DmaCmd/GemmCmd/MathCmd messages, not on wire transactions). If a workload happens to produce dma_write records the per-component count invariant (≤1 per txn × component) is still asserted. - ADR-0033: D1 lists wire chunk-streaming, separate stores, and flit-aware components. D2/D3/D4 updated for new wire model. D6 future work notes op_log full integration with chunk-streaming. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 23:12:50 -07:00
ywkang	5fdb6f8797	Latency model: HBM PC striping + chunk-loop drain (ADR-0033) Previous model double-counted slow-upstream paths (e.g., 64KB via UCIe 128 GB/s was ~2x pessimistic). HBM CTRL now distributes bursts across 8 pseudo-channels via global round-robin, with per-chunk commit timing that pipelines correctly against the bottleneck link's data arrival. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 21:59:07 -07:00
mukesh	1c33afec55	ADR-0032 + intra_* opposite directions in IPCQ install Add intra_N/S/E/W to install.py _OPPOSITE_DIR table so the intra-cube PE-to-PE namespace is symmetrical with intercube N/S/E/W. ADR-0032 documents the intercube allreduce algorithm (supersedes ADR-0029). Refresh ADR-0024/0025/0029 cross-refs and update test_intercube_sfr_config.py to cover the new intra_* mappings. Drop the obsolete test_ccl_round_robin_recv.py (replaced by intercube tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 16:43:01 -07:00
ywkang	81cc32c46b	ADR-0001 Rev 2: 51-bit PhysAddr layout with concrete sub-unit tables Remove rack_id (4 bits), rename sip_seg→die_id, shift fields to enable 42-bit local_offset (4 TB per die). Define PE_LOCAL/MCPU_LOCAL/CUBE_SRAM sub-unit tables for AHBM dies and IOCPU sub-unit table for IOCHIPLET dies (1 TB window). Supersedes ADR-0031. Also fixes latent VA/PA confusion in pe_dma pipeline DMA path where virtual addresses were decoded as physical addresses without MMU translation — previously masked by coincidental bit-position alignment. 529 passed (+6 recovered), 10 pre-existing failures unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-27 15:52:29 -07:00
mukesh	c1a5cf3a2a	ADR-0009 D5: chain-aware target_start_ns + zero-byte launch fanout The single-walk predictor (find_node_path(io_cpu, pe_cpu) + compute_path_latency_ns) under-shot actual dispatch latency for far cubes -- the routing graph could pick a path bypassing M_CPU, and non-zero-nbytes launch sub-txns serialized on shared first hops. Far PEs arrived at _execute_kernel after target_start_ns, silently skipped the barrier yield, and started pe_exec_start late. Their reported pe_exec_ns under-counted by exactly the late_ns amount (63 ns observed at h4 cube4.pe0 in the IPCQ test, up to 113 ns worst case for cubes 9-11), producing the suspicious flat region in the h4 IPCQ curve at 8192/10240 bytes. Fix: - IO_CPU predictor uses the explicit two-leg chain (IO_CPU->M_CPU + M_CPU->PE_CPU - io.overhead - m.overhead), so every PE on every targeted cube has a barrier >= its real dispatch arrival. - Kernel-launch fanout sub-txns carry nbytes=0 (control-plane, not data-plane), removing the per-cube fanout serialization that pushed far M_CPUs past the predictor. - Legacy io_cpu mirror updated. ADR-0009 D5 mechanism updated to specify the two-leg formula and the nbytes=0 requirement. New tests/test_d5_barrier_invariant.py asserts (a) no PE enters _execute_kernel after target_start_ns and (b) every PE in a multi-cube launch has identical pe_exec_start -- both regressions silently pass on the existing tests/test_kernel_launch_sync.py because that test only inspects post-aggregation max(pe_exec_ns). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 15:12:58 -07:00
mukesh	90874abbfe	ADR-0023 D9: blocking credit-emit with full-path latency PE_IPCQ._handle_recv now yields-from _delayed_credit_send instead of spawning it as a fork, so the receiver's pe_exec_ns includes the credit-return cost. _credit_latency_ns switches from compute_drain_ns(path, 16) to compute_path_latency_ns(path, 16) and fixes a latent find_path bug where the destination lacked the ".pe_dma" suffix (silently returned 0 ns under the bare except). Net effect on h3/h4 inter-cube pe-to-pe latency: IPCQ >= raw DMA at every size, matching real-HW posted-write semantics. tl.send remains fire-and-forget. ADR-0023 D9 amended; new diagnostic test tests/test_pe_to_pe_diagnostic.py captures per-PE pe_exec_ns, paths, drain, and meta-arrival timing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 15:12:38 -07:00
mukesh	14d800b0ae	Kernel-launch sync (ADR-0009 D5) and IPCQ drain at inbound (ADR-0023) - KernelLaunchMsg gains target_start_ns: IO_CPU stamps a global barrier (max path latency across every target PE), M_CPU passes it through, PE_CPU yields until it before recording pe_exec_start. Every PE in a launch begins kernel execution at the same env.now regardless of its dispatch path length — eliminates per-PE dispatch-offset artifact in cross-PE and cross-cube latency measurements. - PE_DMA._handle_ipcq_inbound now pays Transaction.drain_ns at the top, matching the terminal-drain behavior of ComponentBase._forward_txn for every non-IPCQ Transaction. SRC-side tl.send stays fire-and-forget (sender doesn't yield on sub_done); tl.recv now blocks until bytes have actually drained into its inbox. - ComponentContext: new compute_path_latency_ns helper + node_overhead_ns field populated by GraphEngine. - tests/test_kernel_launch_sync.py: asserts all PEs in one launch produce identical pe_exec_ns for a no-op kernel (zero spread). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 15:30:29 -07:00
ywkang	e7f376ebaa	ADR-0027 rev7 (Megatron TP + worker-wait generalization) + ADR-0026 typo fix ADR-0027 is a design-only change (no production code). Rev 7 closes design across 7 iterations of review. Key decisions: - D0 (worker-wait generalization): ctx.wait in worker context yields to main scheduler, which drains env.run. Solves ADR-0024 Phase B orphan bug (ring_default_ws strict xfail). Normative contracts on resume invariant, fast-path, main-context non-reentrance, barrier loop-until-empty, and scheduler non-progress as user contract. - D0.5 (host-read barrier): Tensor.numpy/data/__getitem__/__repr__/copy_ auto-drain pending before reading. Closed-set via explicit registry (T5.g). copy_ uses global-pending barrier with explicit over-serialization tradeoff. - D1 (torch.multiprocessing.spawn): real-PyTorch API-signature parity, cooperative greenlet scheduler internally. Explicit non-goal on process isolation / address space / failure isolation. Sibling cleanup via SystemExit + SpawnException(errors) wrapping root-cause ranks. - D4/D5 (TP layers): ColumnParallelLinear / RowParallelLinear use torch.launch(gemm_kernel) — no host-side torch.matmul. Yield-safety contract normatively required for all TP forward paths. - Supersedes ADR-0024 D7/D12/D13 as design (none landed). Source of truth declared normative. Test strategy: T1-T8 with numerical-correctness primary (not mean/ aggregate-only), orphan invariant direct assertion, host-read barrier closed-set via registry. Phase 2 acceptance = 524 passed + 0 xfail (ring_default_ws unblocked by D0). ADR-0026 typo fix: torch.cuda.set_device → torch.ahbm.set_device in DPPolicy docstring (ADR-0024 D10 convention). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 14:13:26 -07:00
ywkang	357cab525b	ADR-0026: DPPolicy intra-device only + ShardSpec structural coords DPPolicy no longer carries a cross-SIP axis. SIP-level placement is solely controlled by torch.ahbm.set_device(rank) (ADR-0024); DPPolicy itself describes only the cube × PE layout within one SIP. ShardSpec switches to structural (sip, cube, pe) coordinates; the flat pe_index field/property is fully removed — silent drift between global-flat and SIP-local interpretations was a foot-gun flagged by ADR-0024 D11. Breaking API (explicit TypeError / AttributeError): - DPPolicy(sip=...) / DPPolicy(num_sips=...) -> TypeError - ShardSpec.pe_index -> AttributeError - ShardSpec(pe_index=...) -> TypeError - resolve_dp_policy now takes target_sip= (required), no num_sips. Downstream migration: - PE allocator dict keyed by (sip, cube, pe) tuples, in both _ensure_allocators and _free_tensor. deploy_tensor uses tuple lookup. - _create_tensor passes target_sip=current_sip; post-hoc pe_index shifting removed entirely. - launch._compute_local_shape drops the dp.sip branch. - Internal resolvers (column_wise / row_wise / replicate / tiled_) return _LocalPeShard (cube-local identifier) instead of ShardSpec — resolve_dp_policy lifts them to full structural coords. Tests: - New tests/test_adr0026_dppolicy_intra_device.py (12 tests) pins the contract end-to-end. - test_sip_parallel.py rewritten: SIP composition now modeled as two resolve_dp_policy(target_sip=...) calls (ADR-0024 launcher style). - Call-site migration: test_tensor, test_va_integration, test_va_offset, test_runtime_api_tensor, test_tl_recv_async, test_ccl_ and benches gemm_single_pe, gpt3_qkv, va_offset_verify, ccl_allreduce (legacy branch) all use intra-device DPPolicy and structural ShardSpec. Result: 523 passed, 1 strict xfail (ring_default_ws — unchanged ADR-0024 Phase B blocker; architectural fix deferred to ADR-0027). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 13:02:19 -07:00
ywkang	e1084800ab	docs: add ADRs 0024–0031 for SIP-TP launcher stack ADR-0024 (SIP-level TP launcher): rank = SIP abstraction, engine-routed install, mp.spawn parity, epoch barrier, ShardSpec structural coords. ADR-0025 (IPCQ direction addressing): address-based matching for meta arrival and credit return; fixes 2-rank bidirectional ring deadlock. ADR-0026 (DPPolicy intra-device only): remove sip/num_sips fields; ShardSpec uses structural (sip, cube, pe); pe_index property removed. ADR-0027 (Megatron-style TP API): ColumnParallelLinear / RowParallelLinear on top of ADR-0024 launcher. Backlog until 0024/0025/0026 land. ADR-0028 (DTensor support): stub / future work. ADR-0029 (Hierarchical all-reduce): 3-level reduce using all_pes mapper and multi_pe_sip_local validator from ADR-0024. Backlog. ADR-0030 (IPCQ PhysAddr integration): blocked on ADR-0031. ADR-0031 (PhysAddr PE-resource extension): stub; local_offset range-based partition approach; specific ranges TBD. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 00:38:27 -07:00
ywkang	b2c52f0e34	Add English translations for ADR-0018, 0019, 0020, 0021 - ADR-0018: LA-based memory address abstraction + BAAW + HBM channel mapping - ADR-0019: CUBE NOC per-channel and aggregated HBM connection model - ADR-0020: 2-pass data execution model (timing/data separation, greenlet) - ADR-0021: PE pipeline refactor (component separation + token self-routing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 16:31:32 -07:00
ywkang	998cc85762	Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023) Major changes: PE-level IPCQ infrastructure: - New PE_IPCQ component: ring-buffer control plane with 4-direction neighbor mapping, head/tail pointers, backpressure (poll/sleep). - PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA, including in-flight data snapshot (D9) and op_log recording at outbound time for Phase 2 replay correctness. - IpcqDmaToken piggyback model: data + metadata travel together, atomic visibility at receiver (invariant I6). - Credit return fast path: bottleneck-BW latency, no fabric vc_comm. Phase 2 data execution (ADR-0020 integration): - op_log extended: DmaWriteCmd now captures src_space/src_addr for Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time. - DataExecutor replays dma_write + ipcq_copy in t_start order. - Engine._flush_data_phase: incremental cursor-based replay after each engine.wait() so host reads see post-Phase-2 data. - KernelRunner Phase 1 writes disabled when op_log is active to prevent stale data from corrupting the MemoryStore snapshot. TLContext / kernel API: - tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype), tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode. - TensorHandle operator overloading (add/sub/mul/div) via thread-local active TLContext → MathCmd dispatch through PE_MATH. - PE-local scratch allocator for math output handles. - tl.load returns space="hbm" handles for correct Phase 2 addressing. - Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv. Unified ccl_allreduce bench (PyTorch-compat host code): - Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch) split matching real PyTorch DDP worker pattern. - torch.distributed facade: init_process_group, get_world_size, get_rank, get_backend, all_reduce, barrier — only real PyTorch names. - AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches kernel via tensor shard metadata (n_elem from shards[0].nbytes). - world_size derived from topology spec (sips × cubes × pes_per_cube) with optional algorithm-level override in ccl.yaml. Tensor API (PyTorch-compat surface): - Tensor.numpy(): gather-aware (all shards via VA-based addressing). - Tensor.copy_(source): scatter from host tensor into sharded target. - RuntimeContext.from_numpy(arr): host-side staging tensor. - Tensor.data property fixed to use numpy() (was shards[0]-only). Algorithm modules moved to src/kernbench/ccl/algorithms/: - ring_allreduce, mesh_allreduce, tree_allreduce, hello_send. - Each module exports kernel_args(world_size, n_elem) helper. - ccl.yaml module paths updated to kernbench.ccl.algorithms.*. Dead code removed: - 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.). - _run_ccl_bench greenlet-per-SIP scheduler. - benches.loader.is_ccl_bench + run_rank detection. - benches/ccl/ directory. Tests: - New test_ccl_allreduce_matrix.py: 7 parametrized cases (ring×3 buffers, ring 8/16, mesh 4, tree 7). - New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests. - Existing tests updated for new import paths + world_size_override. Docs: - Korean ccl-author-guide.md and ADR-0023 paths updated. - New English versions: ccl-author-guide.en.md, ADR-0023.en.md. 502 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 19:36:59 -07:00
ywkang	ff2c677a9c	Add 2D grid program_id semantics (ADR-0022) tl.program_id(axis=0) returns local PE id within cube, tl.program_id(axis=1) returns cube id. Enables cube-aware sharding in benchmark kernels. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 16:49:56 -07:00
ywkang	161132cdcb	ADR-0021: PE pipeline refactor — component separation + token self-routing Design for refactoring pe_accel monolith into independent builtin components: - D1: 6 independent components (scheduler, DMA, fetch_store, GEMM, MATH, TCM) - D2: Token self-routing — scheduler only dispatches + tracks completion - D3: done signal = simpy.Event (HW wire), data = message (queue) - D4: Async pipeline with single FIFO feeder, command-level ordering - D5: PE_FETCH_STORE separates TCM↔register from compute - D6: Compute components implement _process() only, chaining in base - D7: Topology adds pe_fetch_store + chaining edges - D8: Existing builtin/pe_accel → builtin_legacy backup, new builtin - D9: TileToken with plan + stage_idx for self-routing Key decisions from review: - No PipelineManager object — scheduler + existing ports sufficient - PipelineContext with exactly-once completion contract - _feed_loop singleton per scheduler, FIFO command ordering - Intra-PE chaining: no explicit latency model - Latency models ported from pe_accel current implementation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 23:21:40 -07:00
ywkang	140b85436a	ADR-0020: 2-Pass data execution model with greenlet kernel runner Design for actual data storage/computation in HBM/TCM/SRAM components: - Phase 1: SimPy timing + MemoryStore (memory ops data-aware via greenlet) - Phase 2: op_log-based numpy execution for GEMM/Math verification - Greenlet-based KernelRunner replaces Phase 0 command list generation - tl.load() returns real data in Phase 1, enabling memory-based control flow - ComponentBase hook for op logging (single source of truth) - MemoryStore: numpy ndarray tensor-granular storage with reference semantics Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 23:53:49 -07:00
ywkang	5917b3497c	Replace xbar/bridge/single-NOC with explicit router mesh (ADR-0019) - Remove xbar_top/bot, bridge, single noc node from topology - Each cube_mesh.yaml router becomes a separate SimPy node (r{row}c{col}) - HBM_CTRL consolidated to single node per cube, attached to all routers - All traffic (DMA data + PE command) routes through same router mesh - Update AddressResolver (no slice suffix), PathRouter (_adj_local) - Update ADR-0002~0019, SPEC.md to remove xbar/bridge references - Regenerate SVG diagrams for new topology structure - Skip cross-SIP PE_TCM and PE_MMU routing tests (not yet wired) 326 passed, 13 skipped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 17:51:28 -07:00
ywkang	31c7110da7	Add ADR-0018 (LA/BAAW addressing) and ADR-0019 (NOC per-channel HBM) ADR-0018: LA replaces VA, BAAW segment-based mapping in PE_DMA, 1:1 (per-channel) and n:1 (aggregated) modes with parameterized channel count. ADR-0019: xbar/bridge removal, channel router topology with horizontal line layout, aggregated router for n:1 mode, unified NOC path for local/remote HBM access. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 01:05:27 -07:00
ywkang	63669f82cb	Add SIP-level tensor parallelism, component registry YAML, VA offset verification - DPPolicy: 3-level (sip/cube/pe), unified naming (column_wise/row_wise) - PE_CPU: auto num_programs from cube shard count - context.launch(): per-SIP KernelLaunchMsg with local va_base + auto local shape - deploy_tensor: removed mmus param, MMU mapping is context-only responsibility - ComponentRegistry: YAML-based lazy loading (components.yaml), impls→builtin rename - VA offset bench + tests: 2D/1D, standard Triton kernel pattern Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 01:13:17 -07:00
ywkang	08812eda58	Add virtual memory support: PE_MMU, VA allocator, fabric MmuMapMsg Implement VA/MMU layer (ADR-0011 Phase 1) enabling Triton kernels to use contiguous virtual addresses on sharded tensors. Key changes: - PE_MMU component: hybrid inbox (MmuMapMsg) + sync translate() for PE_DMA - VirtualAllocator + PEMemAllocator: free-list with coalescing - MmuMapMsg/MmuUnmapMsg fabric path with SIP-level routing - DPPolicy-based mapping: replicate=local, sharded=broadcast - Tensor lifecycle: del + weakref cleanup, context manager - Rename: TensorHandle.pa→addr, DmaReadCmd.src_pa→src_addr, ctx→torch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 00:01:47 -07:00
ywkang	fc6abbc8ee	Add CHANGES.md, README, update SPEC/ADRs for release 2 - CHANGES.md: detailed changelog for release 1 and 2 - README.md: full project docs with install, probe, run, test usage - SPEC.md: add ADR-0014~0017 references, update R7 for pcie_ep endpoint - ADR-0003: update NOC description to reference ADR-0017 - ADR-0004: add HBM efficiency factor (0.8) to BW guarantee contract - ADR-0014: status Proposed -> Accepted - ADR-0015: update D4 to M_CPU bypass for Memory R/W, add ADR-0016/0017 links - ADR-0016 (new): IOChiplet NOC and memory data path - ADR-0017 (new): Cube NOC 2D mesh architecture - Fix MD lint warnings (unfenced code blocks) across all docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 01:43:15 -07:00
ywkang	d75da439c6	Add probe CLI improvements, D2H read, UCIe/HBM tuning, BW sweep - Probe CLI: restructured output (tables first, routes below), per-hop timestamps, split cross-cube into best/worst cases, D2H read section - UCIe overhead: 1ns -> 8ns per port (16ns per crossing) to fix cross-cube-best < cross-half latency inversion - HBM efficiency: added efficiency=0.8 factor to hbm_ctrl, reducing effective BW from 256 to 204.8 GB/s - Multi-size BW sweep: saturation tables (4KB-1MB) for all probe cases - Probe default data size: 4KB -> 32KB for more realistic measurements - IOChiplet NOC + D2H topology and tests - NOC mesh, xbar, BW occupancy components and tests - Cube mesh visualization diagram 278 tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 01:16:18 -07:00
ywkang	6f43807900	commit - release 1	2026-03-18 11:47:48 -07:00

32 Commits