Commit Graph

  • b3ca532023 attention: milestone-gqa-llama70b figures + MILESTONE_FAST (sub-cycle 4c, 5/6) master mukesh 2026-06-01 22:23:28 -07:00
  • e748a62264 attention: land milestone-gqa-llama70b 4-panel sweep bench (ADR-0057 v1) mukesh 2026-06-01 21:57:12 -07:00
  • 222815d374 attention: add rank_axis kwarg to mesh kernels for multi_user cube ring mukesh 2026-06-01 19:53:18 -07:00
  • d9e767d048 runtime_api: ctx.launch honors DPPolicy.num_cubes + adds _auto_dim_remap opt-out mukesh 2026-06-01 19:33:40 -07:00
  • 313dee503c sim_engine: fix IPCQ slot-wrap snapshot race in Phase 2 replay mukesh 2026-06-01 19:14:09 -07:00
  • b1d6fafd3a eval: commit milestone bench output (track generated figures + results) mukesh 2026-05-22 15:37:27 -07:00
  • cc1bbd0ab7 eval: fold GEMM/allreduce harnesses into self-contained milestone benches mukesh 2026-05-22 15:19:32 -07:00
  • e33e76f2d1 adr: add INDEX.md (auto-generated by tools/generate_adr_index.py) ywkang 2026-05-22 11:15:37 -07:00
  • bd49c93703 adr: add ADR-0050-0053 — close /report's second-pass G4 candidates ywkang 2026-05-22 10:52:42 -07:00
  • 9a02955770 adr: add ADR-0046-0049 — close G4 coverage gaps from /report ywkang 2026-05-22 10:25:04 -07:00
  • 5f8dd688f5 adr: add ADR-0045 (bench module contract — registration, dispatch, authoring) ywkang 2026-05-21 16:29:45 -07:00
  • fd56b6cacd adr: add ADR-0043/0044 (eval harnesses); reconcile ADR-0024/0032 for SIP w/h sccl-distributed-allreduce mukesh 2026-05-21 10:26:25 -07:00
  • 0e346b939d gemm: test-generated GEMM plots under tests/gemm/ + docs/diagrams/gemm_plots/ mukesh 2026-05-21 09:58:08 -07:00
  • b610cb0d9a sccl: drive allreduce tests via torch.distributed; reorganize into tests/sccl/ mukesh 2026-05-20 22:24:43 -07:00
  • ff7d727ddd CCL allreduce: rename to lrab_hierarchical_allreduce + descriptive plots mukesh 2026-05-20 20:50:48 -07:00
  • e77e4a1703 types: narrow BenchResult.engine to GraphEngine, cast topology in engine_factory ywkang 2026-05-20 14:54:18 -07:00
  • 1f36baa898 ADR: add 0038-0042 (pcie_ep, pe_mmu, pe_tcm, sram, tiling) ywkang 2026-05-20 14:43:03 -07:00
  • 049e3d8bb3 benches: package as kernbench.benches, add @bench registry + list subcommand ywkang 2026-05-20 14:42:10 -07:00
  • 168b0c89f0 ADR: translate adr-ko/ to Korean, fix ADR-0013 slug, refine Status check ywkang 2026-05-20 08:17:56 -07:00
  • a796c1d2f7 ADR: bilingual structure — EN canonical in adr/, KO mirror in adr-ko/ ywkang 2026-05-20 01:38:44 -07:00
  • 687c98086d ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037 ywkang 2026-05-20 01:15:55 -07:00
  • 22fd0d2b9d ADR: introduce docs/history/, merge 0011+0018, prune migration cruft ywkang 2026-05-19 11:42:45 -07:00
  • ecc57d050d CLAUDE.md: restructure into Part 1 (general) / Part 2 (project-specific) ywkang 2026-05-18 12:08:10 -07:00
  • a7fe785e5f tl.composite: fused epilogue ops with per-op scope mukesh 2026-05-15 10:16:47 -07:00
  • a76487ca48 PE_DMA perf: SIP-wide scenarios + dual outputs + clearer naming ywkang 2026-05-15 09:43:09 -07:00
  • a143925a12 PE_DMA perf: dual-peak utilisation (single-path + aggregate) ywkang 2026-05-15 08:53:00 -07:00
  • 0bf220fed0 Switch PE_DMA perf plots to Effective BW utilization ywkang 2026-05-15 07:59:45 -07:00
  • a759d58007 Add PE_DMA latency-breakdown plots + self-verification harness ywkang 2026-05-15 01:23:42 -07:00
  • b8213d43a9 ADR-0019 D1/D4: per-PE HBM CTRL partitioning ywkang 2026-05-15 01:04:30 -07:00
  • aaa1cbfaf6 ADR-0033 D6: address-based PC selection at HBM CTRL ywkang 2026-05-15 00:18:46 -07:00
  • a44f832be5 Regenerate latency plots/diagrams for post-Phase-2c model ywkang 2026-05-14 23:24:01 -07:00
  • a0cccc71e8 Add HW architecture overview (Korean) ywkang 2026-05-14 23:23:52 -07:00
  • 32b29a1e5c ADR-0003/0014: generalize "router mesh" to "NOC" ywkang 2026-05-14 23:23:46 -07:00
  • c9bd5387ac ADR-0033 D6: reorder future work by workload impact ywkang 2026-05-14 23:21:35 -07:00
  • 9beb140eaa ADR-0033 D6: clarify what multi-flow merging actually models ywkang 2026-05-14 23:18:19 -07:00
  • c6788788a4 ADR-0033 Phase 2c-3 finish: op_log test + ADR doc reflect chunk-streaming ywkang 2026-05-14 23:12:50 -07:00
  • 6824a935c9 Calibrate 3 tests for ADR-0033 Phase 2c per-flit wire timing ywkang 2026-05-14 23:06:33 -07:00
  • 4929040cf1 Phase 2c-2/3: per-flit wire timing + flit-aware routers + HBM CTRL ywkang 2026-05-14 22:43:40 -07:00
  • b31b3e8248 Phase 2c-1: wire chunkifies into Flits + reassembly compat layer ywkang 2026-05-14 22:03:59 -07:00
  • 5fdb6f8797 Latency model: HBM PC striping + chunk-loop drain (ADR-0033) ywkang 2026-05-14 21:59:07 -07:00
  • f6d262e359 Honest measured pipeline efficiency: two timing fixes mukesh 2026-05-14 14:19:17 -07:00
  • 83ea97b05f Composite GEMM: K-loop accumulator residency, pinned operands, sweep + deck mukesh 2026-05-13 15:00:41 -07:00
  • 5accd98171 Add deck builder + overview-with-ref diagram scripts mukesh 2026-04-28 18:20:54 -07:00
  • a563169e89 Add tl.recv_no_consume diagnostic API for apples-to-apples pe2pe plot mukesh 2026-04-28 18:20:44 -07:00
  • 9c129d6131 ADR-0023 D9.7+: charge PE↔bank fabric hop for SRAM/HBM IPCQ slots mukesh 2026-04-28 18:20:28 -07:00
  • 533e699299 IPCQ-DMA co-design HW design doc + fix IPCQ slot BW model ywkang 2026-04-28 13:31:02 -07:00
  • 54fcb7e4bc Add tests/test_emit_ipcq_diagram.py (missed from earlier commit) mukesh 2026-04-27 21:42:44 -07:00
  • ad5f01ab13 Merge origin/master: combine single-cube fast path + center-root reduce mukesh 2026-04-27 21:41:46 -07:00
  • 1c5752a9ec Intercube allreduce: center root + bidirectional reduce mukesh 2026-04-27 21:28:58 -07:00
  • 84a1325e5c ADR-0023 D9.7: IPCQ slot-memory latency model (TCM/SRAM/HBM) mukesh 2026-04-27 21:28:34 -07:00
  • 1e39214f89 Move generated diagrams to docs/diagrams/; add IPCQ diagram emitter mukesh 2026-04-27 21:28:17 -07:00
  • fca24feac5 Fix all remaining test failures: single-cube allreduce + matplotlib dep ywkang 2026-04-27 21:25:31 -07:00
  • d55dc6cb4f Merge: accept remote pe2pe summary.csv ywkang 2026-04-27 17:13:06 -07:00
  • 46291bf91b PE-to-PE latency: drop h5 inter-SIP panel from overview mukesh 2026-04-27 16:43:28 -07:00
  • 04c912f53e Allreduce sweep: parametrized + xdist parallelism + topology diagram mukesh 2026-04-27 16:43:19 -07:00
  • 1c33afec55 ADR-0032 + intra_* opposite directions in IPCQ install mukesh 2026-04-27 16:43:01 -07:00
  • 81cc32c46b ADR-0001 Rev 2: 51-bit PhysAddr layout with concrete sub-unit tables ywkang 2026-04-27 15:52:29 -07:00
  • e9cc40f74d Rectangular SIP topology + 6-device allreduce sweep mukesh 2026-04-27 15:13:14 -07:00
  • c1a5cf3a2a ADR-0009 D5: chain-aware target_start_ns + zero-byte launch fanout mukesh 2026-04-27 15:12:58 -07:00
  • 90874abbfe ADR-0023 D9: blocking credit-emit with full-path latency mukesh 2026-04-27 15:12:38 -07:00
  • 19dfc86dc3 Allreduce latency sweep across topologies and data sizes mukesh 2026-04-27 10:16:29 -07:00
  • 14d800b0ae Kernel-launch sync (ADR-0009 D5) and IPCQ drain at inbound (ADR-0023) mukesh 2026-04-23 15:30:29 -07:00
  • 6918e6e906 PE-to-PE latency test + supporting fixes mukesh 2026-04-22 21:04:31 -07:00
  • 1d8b9401e5 Intercube allreduce: pe0 cube-mesh reduce + multi-SIP ring/torus/mesh kernbench_Apr16 mukesh 2026-04-16 17:33:42 -07:00
  • cfc2d74ec4 Refactor ccl_allreduce bench: rank=SIP only, remove rank=PE legacy path ywkang 2026-04-14 16:45:27 -07:00
  • 105f1dc09e ADR-0027: Megatron TP API + worker-wait generalization + mp.spawn ywkang 2026-04-14 16:31:13 -07:00
  • e7f376ebaa ADR-0027 rev7 (Megatron TP + worker-wait generalization) + ADR-0026 typo fix ywkang 2026-04-14 14:13:26 -07:00
  • 357cab525b ADR-0026: DPPolicy intra-device only + ShardSpec structural coords ywkang 2026-04-14 13:02:19 -07:00
  • 787409ced1 ADR-0024 Phase B: update xfail reason with architectural blocker details ywkang 2026-04-14 12:46:33 -07:00
  • 79124daab1 ADR-0024 Phase B (partial): scheduler-level collective drain ywkang 2026-04-14 09:14:03 -07:00
  • 4ba0a83e71 Implement ADR-0024 Phase A: SIP-level TP launcher MVP ywkang 2026-04-14 09:00:28 -07:00
  • 32536daf2e Fix ADR-0025: IPCQ direction addressing via address-based matching ywkang 2026-04-14 00:38:41 -07:00
  • e1084800ab docs: add ADRs 0024–0031 for SIP-TP launcher stack ywkang 2026-04-14 00:38:27 -07:00
  • b2c52f0e34 Add English translations for ADR-0018, 0019, 0020, 0021 ywkang 2026-04-13 16:31:32 -07:00
  • 10b33b44ba Add Tensor indexing + hierarchical 3-level all-reduce kernel ywkang 2026-04-12 23:52:04 -07:00
  • 1c8ddc2d03 Fix Phase 1 slot-overwrite race + PE_MATH latency model (n_slots=4 safe) ywkang 2026-04-12 23:02:19 -07:00
  • 74f5f5cf08 Add session-scoped topology fixture in tests/conftest.py ywkang 2026-04-12 21:13:25 -07:00
  • 372c987995 Reduce test time to 12s: shrink GEMM dims + enable pytest-xdist ywkang 2026-04-12 21:06:41 -07:00
  • bcf941dcee Speed up regression: 25min → 6min (test matrix + DataExecutor cleanup) ywkang 2026-04-12 20:52:07 -07:00
  • 998cc85762 Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023) ywkang 2026-04-12 19:36:59 -07:00
  • ff2c677a9c Add 2D grid program_id semantics (ADR-0022) ywkang 2026-04-09 16:49:56 -07:00
  • dc3fb02aed Add --verify-data CLI flag, Tensor.data property, parallel DataExecutor ywkang 2026-04-09 09:34:01 -07:00
  • 59e36f0c34 Add E2E pipeline tests: greenlet op_log, GEMM accuracy, latency regression ywkang 2026-04-09 00:28:03 -07:00
  • 81ce55571d Rename impl names: add builtin. prefix for clear provenance ywkang 2026-04-09 00:16:24 -07:00
  • 1d95df4bee Restructure legacy backups, remove pe_accel, fix DMA self-routing ywkang 2026-04-09 00:02:26 -07:00
  • 95d583ef9f Add Phase 1→Phase 2 e2e data tests + GraphEngine enable_data mode ywkang 2026-04-08 23:49:28 -07:00
  • f5d1606f9d Add ADR-0021 pipeline tests: self-routing, tiling, overlap ywkang 2026-04-08 23:40:19 -07:00
  • b6eb97c49a Implement ADR-0021: PE pipeline refactor with token self-routing ywkang 2026-04-08 23:35:31 -07:00
  • 161132cdcb ADR-0021: PE pipeline refactor — component separation + token self-routing ywkang 2026-04-08 23:21:40 -07:00
  • 51004c311c Implement ADR-0020: 2-pass data execution with greenlet kernel runner ywkang 2026-04-08 00:22:44 -07:00
  • 140b85436a ADR-0020: 2-Pass data execution model with greenlet kernel runner ywkang 2026-04-07 23:53:49 -07:00
  • eb792e6212 Remove xbar/noc remnants, rule-based cube-view connectors ywkang 2026-04-06 23:59:12 -07:00
  • 7640635f90 M_CPU/SRAM placement via pos_mm in topology.yaml (nearest router) ywkang 2026-04-05 00:48:20 -07:00
  • 3ea4fa90f8 Cube-view: increase 45° stub length and component gap for visibility ywkang 2026-04-05 00:38:27 -07:00
  • 5125d92c17 Cube-view: M_CPU north, 45° stub-straight-stub connector pattern ywkang 2026-04-05 00:34:48 -07:00
  • 72acc5c8bb Cube-view: UCIe flush against cube edges ywkang 2026-04-05 00:28:58 -07:00
  • bde76ec959 Cube-view: 45° diagonal from router, then straight to component ywkang 2026-04-05 00:25:41 -07:00
  • d3de982ea4 Cube-view: 90° router mesh links, 45° component connectors ywkang 2026-04-05 00:20:28 -07:00
  • df81835d84 Cube-view: UCIe position/size from topology.yaml (ucie_mm.size=2.0) ywkang 2026-04-05 00:11:11 -07:00
  • 66ec6cd40c Cube-view: UCIe components inside cube boundary with port boxes ywkang 2026-04-04 23:58:32 -07:00