mukesh
90874abbfe
ADR-0023 D9: blocking credit-emit with full-path latency
...
PE_IPCQ._handle_recv now yields-from _delayed_credit_send instead of
spawning it as a fork, so the receiver's pe_exec_ns includes the
credit-return cost. _credit_latency_ns switches from
compute_drain_ns(path, 16) to compute_path_latency_ns(path, 16) and
fixes a latent find_path bug where the destination lacked the
".pe_dma" suffix (silently returned 0 ns under the bare except).
Net effect on h3/h4 inter-cube pe-to-pe latency: IPCQ >= raw DMA at
every size, matching real-HW posted-write semantics. tl.send remains
fire-and-forget. ADR-0023 D9 amended; new diagnostic test
tests/test_pe_to_pe_diagnostic.py captures per-PE pe_exec_ns, paths,
drain, and meta-arrival timing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-27 15:12:38 -07:00
mukesh
14d800b0ae
Kernel-launch sync (ADR-0009 D5) and IPCQ drain at inbound (ADR-0023)
...
- KernelLaunchMsg gains target_start_ns: IO_CPU stamps a global barrier
(max path latency across every target PE), M_CPU passes it through,
PE_CPU yields until it before recording pe_exec_start. Every PE in a
launch begins kernel execution at the same env.now regardless of its
dispatch path length — eliminates per-PE dispatch-offset artifact in
cross-PE and cross-cube latency measurements.
- PE_DMA._handle_ipcq_inbound now pays Transaction.drain_ns at the top,
matching the terminal-drain behavior of ComponentBase._forward_txn for
every non-IPCQ Transaction. SRC-side tl.send stays fire-and-forget
(sender doesn't yield on sub_done); tl.recv now blocks until bytes
have actually drained into its inbox.
- ComponentContext: new compute_path_latency_ns helper + node_overhead_ns
field populated by GraphEngine.
- tests/test_kernel_launch_sync.py: asserts all PEs in one launch
produce identical pe_exec_ns for a no-op kernel (zero spread).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-23 15:30:29 -07:00
ywkang
e7f376ebaa
ADR-0027 rev7 (Megatron TP + worker-wait generalization) + ADR-0026 typo fix
...
ADR-0027 is a design-only change (no production code). Rev 7 closes design
across 7 iterations of review. Key decisions:
- D0 (worker-wait generalization): ctx.wait in worker context yields to
main scheduler, which drains env.run. Solves ADR-0024 Phase B orphan
bug (ring_default_ws strict xfail). Normative contracts on resume
invariant, fast-path, main-context non-reentrance, barrier
loop-until-empty, and scheduler non-progress as user contract.
- D0.5 (host-read barrier): Tensor.numpy/data/__getitem__/__repr__/copy_
auto-drain pending before reading. Closed-set via explicit registry
(T5.g). copy_ uses global-pending barrier with explicit
over-serialization tradeoff.
- D1 (torch.multiprocessing.spawn): real-PyTorch API-signature parity,
cooperative greenlet scheduler internally. Explicit non-goal on
process isolation / address space / failure isolation. Sibling
cleanup via SystemExit + SpawnException(errors) wrapping root-cause
ranks.
- D4/D5 (TP layers): ColumnParallelLinear / RowParallelLinear use
torch.launch(gemm_kernel) — no host-side torch.matmul. Yield-safety
contract normatively required for all TP forward paths.
- Supersedes ADR-0024 D7/D12/D13 as design (none landed). Source of
truth declared normative.
Test strategy: T1-T8 with numerical-correctness primary (not mean/
aggregate-only), orphan invariant direct assertion, host-read barrier
closed-set via registry. Phase 2 acceptance = 524 passed + 0 xfail
(ring_default_ws unblocked by D0).
ADR-0026 typo fix: torch.cuda.set_device → torch.ahbm.set_device in
DPPolicy docstring (ADR-0024 D10 convention).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-14 14:13:26 -07:00
ywkang
357cab525b
ADR-0026: DPPolicy intra-device only + ShardSpec structural coords
...
DPPolicy no longer carries a cross-SIP axis. SIP-level placement is
solely controlled by torch.ahbm.set_device(rank) (ADR-0024); DPPolicy
itself describes only the cube × PE layout within one SIP. ShardSpec
switches to structural (sip, cube, pe) coordinates; the flat pe_index
field/property is fully removed — silent drift between global-flat and
SIP-local interpretations was a foot-gun flagged by ADR-0024 D11.
Breaking API (explicit TypeError / AttributeError):
- DPPolicy(sip=...) / DPPolicy(num_sips=...) -> TypeError
- ShardSpec.pe_index -> AttributeError
- ShardSpec(pe_index=...) -> TypeError
- resolve_dp_policy now takes target_sip= (required), no num_sips.
Downstream migration:
- PE allocator dict keyed by (sip, cube, pe) tuples, in both
_ensure_allocators and _free_tensor. deploy_tensor uses tuple lookup.
- _create_tensor passes target_sip=current_sip; post-hoc pe_index
shifting removed entirely.
- launch._compute_local_shape drops the dp.sip branch.
- Internal resolvers (column_wise / row_wise / replicate / tiled_*)
return _LocalPeShard (cube-local identifier) instead of ShardSpec —
resolve_dp_policy lifts them to full structural coords.
Tests:
- New tests/test_adr0026_dppolicy_intra_device.py (12 tests) pins the
contract end-to-end.
- test_sip_parallel.py rewritten: SIP composition now modeled as two
resolve_dp_policy(target_sip=...) calls (ADR-0024 launcher style).
- Call-site migration: test_tensor, test_va_integration, test_va_offset,
test_runtime_api_tensor, test_tl_recv_async, test_ccl_* and benches
gemm_single_pe, gpt3_qkv, va_offset_verify, ccl_allreduce (legacy
branch) all use intra-device DPPolicy and structural ShardSpec.
Result: 523 passed, 1 strict xfail (ring_default_ws — unchanged
ADR-0024 Phase B blocker; architectural fix deferred to ADR-0027).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-14 13:02:19 -07:00
ywkang
e1084800ab
docs: add ADRs 0024–0031 for SIP-TP launcher stack
...
ADR-0024 (SIP-level TP launcher): rank = SIP abstraction, engine-routed
install, mp.spawn parity, epoch barrier, ShardSpec structural coords.
ADR-0025 (IPCQ direction addressing): address-based matching for meta
arrival and credit return; fixes 2-rank bidirectional ring deadlock.
ADR-0026 (DPPolicy intra-device only): remove sip/num_sips fields;
ShardSpec uses structural (sip, cube, pe); pe_index property removed.
ADR-0027 (Megatron-style TP API): ColumnParallelLinear / RowParallelLinear
on top of ADR-0024 launcher. Backlog until 0024/0025/0026 land.
ADR-0028 (DTensor support): stub / future work.
ADR-0029 (Hierarchical all-reduce): 3-level reduce using all_pes mapper
and multi_pe_sip_local validator from ADR-0024. Backlog.
ADR-0030 (IPCQ PhysAddr integration): blocked on ADR-0031.
ADR-0031 (PhysAddr PE-resource extension): stub; local_offset range-based
partition approach; specific ranges TBD.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-14 00:38:27 -07:00
ywkang
b2c52f0e34
Add English translations for ADR-0018, 0019, 0020, 0021
...
- ADR-0018: LA-based memory address abstraction + BAAW + HBM channel mapping
- ADR-0019: CUBE NOC per-channel and aggregated HBM connection model
- ADR-0020: 2-pass data execution model (timing/data separation, greenlet)
- ADR-0021: PE pipeline refactor (component separation + token self-routing)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-13 16:31:32 -07:00
ywkang
998cc85762
Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)
...
Major changes:
PE-level IPCQ infrastructure:
- New PE_IPCQ component: ring-buffer control plane with 4-direction
neighbor mapping, head/tail pointers, backpressure (poll/sleep).
- PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA,
including in-flight data snapshot (D9) and op_log recording at
outbound time for Phase 2 replay correctness.
- IpcqDmaToken piggyback model: data + metadata travel together,
atomic visibility at receiver (invariant I6).
- Credit return fast path: bottleneck-BW latency, no fabric vc_comm.
Phase 2 data execution (ADR-0020 integration):
- op_log extended: DmaWriteCmd now captures src_space/src_addr for
Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time.
- DataExecutor replays dma_write + ipcq_copy in t_start order.
- Engine._flush_data_phase: incremental cursor-based replay after
each engine.wait() so host reads see post-Phase-2 data.
- KernelRunner Phase 1 writes disabled when op_log is active to
prevent stale data from corrupting the MemoryStore snapshot.
TLContext / kernel API:
- tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype),
tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode.
- TensorHandle operator overloading (add/sub/mul/div) via thread-local
active TLContext → MathCmd dispatch through PE_MATH.
- PE-local scratch allocator for math output handles.
- tl.load returns space="hbm" handles for correct Phase 2 addressing.
- Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv.
Unified ccl_allreduce bench (PyTorch-compat host code):
- Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch)
split matching real PyTorch DDP worker pattern.
- torch.distributed facade: init_process_group, get_world_size,
get_rank, get_backend, all_reduce, barrier — only real PyTorch names.
- AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches
kernel via tensor shard metadata (n_elem from shards[0].nbytes).
- world_size derived from topology spec (sips × cubes × pes_per_cube)
with optional algorithm-level override in ccl.yaml.
Tensor API (PyTorch-compat surface):
- Tensor.numpy(): gather-aware (all shards via VA-based addressing).
- Tensor.copy_(source): scatter from host tensor into sharded target.
- RuntimeContext.from_numpy(arr): host-side staging tensor.
- Tensor.data property fixed to use numpy() (was shards[0]-only).
Algorithm modules moved to src/kernbench/ccl/algorithms/:
- ring_allreduce, mesh_allreduce, tree_allreduce, hello_send.
- Each module exports kernel_args(world_size, n_elem) helper.
- ccl.yaml module paths updated to kernbench.ccl.algorithms.*.
Dead code removed:
- 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.).
- _run_ccl_bench greenlet-per-SIP scheduler.
- benches.loader.is_ccl_bench + run_rank detection.
- benches/ccl/ directory.
Tests:
- New test_ccl_allreduce_matrix.py: 7 parametrized cases
(ring×3 buffers, ring 8/16, mesh 4, tree 7).
- New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests.
- Existing tests updated for new import paths + world_size_override.
Docs:
- Korean ccl-author-guide.md and ADR-0023 paths updated.
- New English versions: ccl-author-guide.en.md, ADR-0023.en.md.
502 tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-12 19:36:59 -07:00
ywkang
ff2c677a9c
Add 2D grid program_id semantics (ADR-0022)
...
tl.program_id(axis=0) returns local PE id within cube,
tl.program_id(axis=1) returns cube id. Enables cube-aware
sharding in benchmark kernels.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-09 16:49:56 -07:00
ywkang
161132cdcb
ADR-0021: PE pipeline refactor — component separation + token self-routing
...
Design for refactoring pe_accel monolith into independent builtin components:
- D1: 6 independent components (scheduler, DMA, fetch_store, GEMM, MATH, TCM)
- D2: Token self-routing — scheduler only dispatches + tracks completion
- D3: done signal = simpy.Event (HW wire), data = message (queue)
- D4: Async pipeline with single FIFO feeder, command-level ordering
- D5: PE_FETCH_STORE separates TCM↔register from compute
- D6: Compute components implement _process() only, chaining in base
- D7: Topology adds pe_fetch_store + chaining edges
- D8: Existing builtin/pe_accel → builtin_legacy backup, new builtin
- D9: TileToken with plan + stage_idx for self-routing
Key decisions from review:
- No PipelineManager object — scheduler + existing ports sufficient
- PipelineContext with exactly-once completion contract
- _feed_loop singleton per scheduler, FIFO command ordering
- Intra-PE chaining: no explicit latency model
- Latency models ported from pe_accel current implementation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-08 23:21:40 -07:00
ywkang
140b85436a
ADR-0020: 2-Pass data execution model with greenlet kernel runner
...
Design for actual data storage/computation in HBM/TCM/SRAM components:
- Phase 1: SimPy timing + MemoryStore (memory ops data-aware via greenlet)
- Phase 2: op_log-based numpy execution for GEMM/Math verification
- Greenlet-based KernelRunner replaces Phase 0 command list generation
- tl.load() returns real data in Phase 1, enabling memory-based control flow
- ComponentBase hook for op logging (single source of truth)
- MemoryStore: numpy ndarray tensor-granular storage with reference semantics
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-07 23:53:49 -07:00
ywkang
5917b3497c
Replace xbar/bridge/single-NOC with explicit router mesh (ADR-0019)
...
- Remove xbar_top/bot, bridge, single noc node from topology
- Each cube_mesh.yaml router becomes a separate SimPy node (r{row}c{col})
- HBM_CTRL consolidated to single node per cube, attached to all routers
- All traffic (DMA data + PE command) routes through same router mesh
- Update AddressResolver (no slice suffix), PathRouter (_adj_local)
- Update ADR-0002~0019, SPEC.md to remove xbar/bridge references
- Regenerate SVG diagrams for new topology structure
- Skip cross-SIP PE_TCM and PE_MMU routing tests (not yet wired)
326 passed, 13 skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-04 17:51:28 -07:00
ywkang
31c7110da7
Add ADR-0018 (LA/BAAW addressing) and ADR-0019 (NOC per-channel HBM)
...
ADR-0018: LA replaces VA, BAAW segment-based mapping in PE_DMA,
1:1 (per-channel) and n:1 (aggregated) modes with parameterized
channel count.
ADR-0019: xbar/bridge removal, channel router topology with
horizontal line layout, aggregated router for n:1 mode,
unified NOC path for local/remote HBM access.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-27 01:05:27 -07:00
ywkang
63669f82cb
Add SIP-level tensor parallelism, component registry YAML, VA offset verification
...
- DPPolicy: 3-level (sip/cube/pe), unified naming (column_wise/row_wise)
- PE_CPU: auto num_programs from cube shard count
- context.launch(): per-SIP KernelLaunchMsg with local va_base + auto local shape
- deploy_tensor: removed mmus param, MMU mapping is context-only responsibility
- ComponentRegistry: YAML-based lazy loading (components.yaml), impls→builtin rename
- VA offset bench + tests: 2D/1D, standard Triton kernel pattern
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-26 01:13:17 -07:00
ywkang
08812eda58
Add virtual memory support: PE_MMU, VA allocator, fabric MmuMapMsg
...
Implement VA/MMU layer (ADR-0011 Phase 1) enabling Triton kernels to use
contiguous virtual addresses on sharded tensors.
Key changes:
- PE_MMU component: hybrid inbox (MmuMapMsg) + sync translate() for PE_DMA
- VirtualAllocator + PEMemAllocator: free-list with coalescing
- MmuMapMsg/MmuUnmapMsg fabric path with SIP-level routing
- DPPolicy-based mapping: replicate=local, sharded=broadcast
- Tensor lifecycle: del + weakref cleanup, context manager
- Rename: TensorHandle.pa→addr, DmaReadCmd.src_pa→src_addr, ctx→torch
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-26 00:01:47 -07:00
ywkang
fc6abbc8ee
Add CHANGES.md, README, update SPEC/ADRs for release 2
...
- CHANGES.md: detailed changelog for release 1 and 2
- README.md: full project docs with install, probe, run, test usage
- SPEC.md: add ADR-0014~0017 references, update R7 for pcie_ep endpoint
- ADR-0003: update NOC description to reference ADR-0017
- ADR-0004: add HBM efficiency factor (0.8) to BW guarantee contract
- ADR-0014: status Proposed -> Accepted
- ADR-0015: update D4 to M_CPU bypass for Memory R/W, add ADR-0016/0017 links
- ADR-0016 (new): IOChiplet NOC and memory data path
- ADR-0017 (new): Cube NOC 2D mesh architecture
- Fix MD lint warnings (unfenced code blocks) across all docs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-19 01:43:15 -07:00
ywkang
d75da439c6
Add probe CLI improvements, D2H read, UCIe/HBM tuning, BW sweep
...
- Probe CLI: restructured output (tables first, routes below), per-hop
timestamps, split cross-cube into best/worst cases, D2H read section
- UCIe overhead: 1ns -> 8ns per port (16ns per crossing) to fix
cross-cube-best < cross-half latency inversion
- HBM efficiency: added efficiency=0.8 factor to hbm_ctrl, reducing
effective BW from 256 to 204.8 GB/s
- Multi-size BW sweep: saturation tables (4KB-1MB) for all probe cases
- Probe default data size: 4KB -> 32KB for more realistic measurements
- IOChiplet NOC + D2H topology and tests
- NOC mesh, xbar, BW occupancy components and tests
- Cube mesh visualization diagram
278 tests pass.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-19 01:16:18 -07:00
ywkang
6f43807900
commit - release 1
2026-03-18 11:47:48 -07:00