• Joined on 2026-04-13
mukesh pushed to master at ywkang/kernbench2 2026-06-02 05:23:35 +00:00
b3ca532023 attention: milestone-gqa-llama70b figures + MILESTONE_FAST (sub-cycle 4c, 5/6)
e748a62264 attention: land milestone-gqa-llama70b 4-panel sweep bench (ADR-0057 v1)
Compare 2 commits »
mukesh pushed to master at ywkang/kernbench2 2026-06-02 02:53:20 +00:00
222815d374 attention: add rank_axis kwarg to mesh kernels for multi_user cube ring
mukesh pushed to master at ywkang/kernbench2 2026-06-02 02:33:43 +00:00
d9e767d048 runtime_api: ctx.launch honors DPPolicy.num_cubes + adds _auto_dim_remap opt-out
mukesh pushed to master at ywkang/kernbench2 2026-06-02 02:15:16 +00:00
313dee503c sim_engine: fix IPCQ slot-wrap snapshot race in Phase 2 replay
mukesh pushed to master at ywkang/kernbench2 2026-05-22 22:37:31 +00:00
b1d6fafd3a eval: commit milestone bench output (track generated figures + results)
mukesh pushed to master at ywkang/kernbench2 2026-05-22 22:32:20 +00:00
cc1bbd0ab7 eval: fold GEMM/allreduce harnesses into self-contained milestone benches
mukesh pushed to master at ywkang/kernbench2 2026-05-21 18:07:48 +00:00
fd56b6cacd adr: add ADR-0043/0044 (eval harnesses); reconcile ADR-0024/0032 for SIP w/h
0e346b939d gemm: test-generated GEMM plots under tests/gemm/ + docs/diagrams/gemm_plots/
b610cb0d9a sccl: drive allreduce tests via torch.distributed; reorganize into tests/sccl/
Compare 3 commits »
mukesh pushed to sccl-distributed-allreduce at ywkang/kernbench2 2026-05-21 17:27:28 +00:00
fd56b6cacd adr: add ADR-0043/0044 (eval harnesses); reconcile ADR-0024/0032 for SIP w/h
mukesh pushed to sccl-distributed-allreduce at ywkang/kernbench2 2026-05-21 16:58:56 +00:00
0e346b939d gemm: test-generated GEMM plots under tests/gemm/ + docs/diagrams/gemm_plots/
mukesh created branch sccl-distributed-allreduce in ywkang/kernbench2 2026-05-21 05:30:43 +00:00
mukesh pushed to sccl-distributed-allreduce at ywkang/kernbench2 2026-05-21 05:30:43 +00:00
b610cb0d9a sccl: drive allreduce tests via torch.distributed; reorganize into tests/sccl/
mukesh pushed to master at ywkang/kernbench2 2026-05-21 03:52:00 +00:00
ff7d727ddd CCL allreduce: rename to lrab_hierarchical_allreduce + descriptive plots
mukesh pushed to master at ywkang/kernbench2 2026-05-15 17:17:30 +00:00
a7fe785e5f tl.composite: fused epilogue ops with per-op scope
mukesh pushed to master at ywkang/kernbench2 2026-05-14 21:20:06 +00:00
f6d262e359 Honest measured pipeline efficiency: two timing fixes
mukesh pushed to master at ywkang/kernbench2 2026-05-13 22:03:13 +00:00
83ea97b05f Composite GEMM: K-loop accumulator residency, pinned operands, sweep + deck
mukesh pushed to master at ywkang/kernbench2 2026-04-29 01:21:48 +00:00
5accd98171 Add deck builder + overview-with-ref diagram scripts
a563169e89 Add tl.recv_no_consume diagnostic API for apples-to-apples pe2pe plot
9c129d6131 ADR-0023 D9.7+: charge PE↔bank fabric hop for SRAM/HBM IPCQ slots
Compare 3 commits »
mukesh pushed to master at ywkang/kernbench2 2026-04-28 04:43:19 +00:00
54fcb7e4bc Add tests/test_emit_ipcq_diagram.py (missed from earlier commit)
ad5f01ab13 Merge origin/master: combine single-cube fast path + center-root reduce
1c5752a9ec Intercube allreduce: center root + bidirectional reduce
84a1325e5c ADR-0023 D9.7: IPCQ slot-memory latency model (TCM/SRAM/HBM)
1e39214f89 Move generated diagrams to docs/diagrams/; add IPCQ diagram emitter
Compare 5 commits »
mukesh pushed to master at ywkang/kernbench2 2026-04-27 23:43:57 +00:00
46291bf91b PE-to-PE latency: drop h5 inter-SIP panel from overview
04c912f53e Allreduce sweep: parametrized + xdist parallelism + topology diagram
1c33afec55 ADR-0032 + intra_* opposite directions in IPCQ install
Compare 3 commits »
mukesh pushed to master at ywkang/kernbench2 2026-04-27 22:13:37 +00:00
e9cc40f74d Rectangular SIP topology + 6-device allreduce sweep
c1a5cf3a2a ADR-0009 D5: chain-aware target_start_ns + zero-byte launch fanout
90874abbfe ADR-0023 D9: blocking credit-emit with full-path latency
Compare 3 commits »
mukesh pushed to master at ywkang/kernbench2 2026-04-27 17:16:47 +00:00
19dfc86dc3 Allreduce latency sweep across topologies and data sizes