-
b3ca532023
attention: milestone-gqa-llama70b figures + MILESTONE_FAST (sub-cycle 4c, 5/6)
master
mukesh
2026-06-01 22:23:28 -07:00
-
e748a62264
attention: land milestone-gqa-llama70b 4-panel sweep bench (ADR-0057 v1)
mukesh
2026-06-01 21:57:12 -07:00
-
222815d374
attention: add rank_axis kwarg to mesh kernels for multi_user cube ring
mukesh
2026-06-01 19:53:18 -07:00
-
d9e767d048
runtime_api: ctx.launch honors DPPolicy.num_cubes + adds _auto_dim_remap opt-out
mukesh
2026-06-01 19:33:40 -07:00
-
313dee503c
sim_engine: fix IPCQ slot-wrap snapshot race in Phase 2 replay
mukesh
2026-06-01 19:14:09 -07:00
-
b1d6fafd3a
eval: commit milestone bench output (track generated figures + results)
mukesh
2026-05-22 15:37:27 -07:00
-
cc1bbd0ab7
eval: fold GEMM/allreduce harnesses into self-contained milestone benches
mukesh
2026-05-22 15:19:32 -07:00
-
e33e76f2d1
adr: add INDEX.md (auto-generated by tools/generate_adr_index.py)
ywkang
2026-05-22 11:15:37 -07:00
-
bd49c93703
adr: add ADR-0050-0053 — close /report's second-pass G4 candidates
ywkang
2026-05-22 10:52:42 -07:00
-
9a02955770
adr: add ADR-0046-0049 — close G4 coverage gaps from /report
ywkang
2026-05-22 10:25:04 -07:00
-
5f8dd688f5
adr: add ADR-0045 (bench module contract — registration, dispatch, authoring)
ywkang
2026-05-21 16:29:45 -07:00
-
fd56b6cacd
adr: add ADR-0043/0044 (eval harnesses); reconcile ADR-0024/0032 for SIP w/h
sccl-distributed-allreduce
mukesh
2026-05-21 10:26:25 -07:00
-
0e346b939d
gemm: test-generated GEMM plots under tests/gemm/ + docs/diagrams/gemm_plots/
mukesh
2026-05-21 09:58:08 -07:00
-
b610cb0d9a
sccl: drive allreduce tests via torch.distributed; reorganize into tests/sccl/
mukesh
2026-05-20 22:24:43 -07:00
-
ff7d727ddd
CCL allreduce: rename to lrab_hierarchical_allreduce + descriptive plots
mukesh
2026-05-20 20:50:48 -07:00
-
e77e4a1703
types: narrow BenchResult.engine to GraphEngine, cast topology in engine_factory
ywkang
2026-05-20 14:54:18 -07:00
-
1f36baa898
ADR: add 0038-0042 (pcie_ep, pe_mmu, pe_tcm, sram, tiling)
ywkang
2026-05-20 14:43:03 -07:00
-
049e3d8bb3
benches: package as kernbench.benches, add @bench registry + list subcommand
ywkang
2026-05-20 14:42:10 -07:00
-
168b0c89f0
ADR: translate adr-ko/ to Korean, fix ADR-0013 slug, refine Status check
ywkang
2026-05-20 08:17:56 -07:00
-
a796c1d2f7
ADR: bilingual structure — EN canonical in adr/, KO mirror in adr-ko/
ywkang
2026-05-20 01:38:44 -07:00
-
687c98086d
ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037
ywkang
2026-05-20 01:15:55 -07:00
-
22fd0d2b9d
ADR: introduce docs/history/, merge 0011+0018, prune migration cruft
ywkang
2026-05-19 11:42:45 -07:00
-
ecc57d050d
CLAUDE.md: restructure into Part 1 (general) / Part 2 (project-specific)
ywkang
2026-05-18 12:08:10 -07:00
-
a7fe785e5f
tl.composite: fused epilogue ops with per-op scope
mukesh
2026-05-15 10:16:47 -07:00
-
a76487ca48
PE_DMA perf: SIP-wide scenarios + dual outputs + clearer naming
ywkang
2026-05-15 09:43:09 -07:00
-
a143925a12
PE_DMA perf: dual-peak utilisation (single-path + aggregate)
ywkang
2026-05-15 08:53:00 -07:00
-
0bf220fed0
Switch PE_DMA perf plots to Effective BW utilization
ywkang
2026-05-15 07:59:45 -07:00
-
a759d58007
Add PE_DMA latency-breakdown plots + self-verification harness
ywkang
2026-05-15 01:23:42 -07:00
-
b8213d43a9
ADR-0019 D1/D4: per-PE HBM CTRL partitioning
ywkang
2026-05-15 01:04:30 -07:00
-
aaa1cbfaf6
ADR-0033 D6: address-based PC selection at HBM CTRL
ywkang
2026-05-15 00:18:46 -07:00
-
a44f832be5
Regenerate latency plots/diagrams for post-Phase-2c model
ywkang
2026-05-14 23:24:01 -07:00
-
a0cccc71e8
Add HW architecture overview (Korean)
ywkang
2026-05-14 23:23:52 -07:00
-
32b29a1e5c
ADR-0003/0014: generalize "router mesh" to "NOC"
ywkang
2026-05-14 23:23:46 -07:00
-
c9bd5387ac
ADR-0033 D6: reorder future work by workload impact
ywkang
2026-05-14 23:21:35 -07:00
-
9beb140eaa
ADR-0033 D6: clarify what multi-flow merging actually models
ywkang
2026-05-14 23:18:19 -07:00
-
c6788788a4
ADR-0033 Phase 2c-3 finish: op_log test + ADR doc reflect chunk-streaming
ywkang
2026-05-14 23:12:50 -07:00
-
6824a935c9
Calibrate 3 tests for ADR-0033 Phase 2c per-flit wire timing
ywkang
2026-05-14 23:06:33 -07:00
-
4929040cf1
Phase 2c-2/3: per-flit wire timing + flit-aware routers + HBM CTRL
ywkang
2026-05-14 22:43:40 -07:00
-
b31b3e8248
Phase 2c-1: wire chunkifies into Flits + reassembly compat layer
ywkang
2026-05-14 22:03:59 -07:00
-
5fdb6f8797
Latency model: HBM PC striping + chunk-loop drain (ADR-0033)
ywkang
2026-05-14 21:59:07 -07:00
-
f6d262e359
Honest measured pipeline efficiency: two timing fixes
mukesh
2026-05-14 14:19:17 -07:00
-
83ea97b05f
Composite GEMM: K-loop accumulator residency, pinned operands, sweep + deck
mukesh
2026-05-13 15:00:41 -07:00
-
5accd98171
Add deck builder + overview-with-ref diagram scripts
mukesh
2026-04-28 18:20:54 -07:00
-
a563169e89
Add tl.recv_no_consume diagnostic API for apples-to-apples pe2pe plot
mukesh
2026-04-28 18:20:44 -07:00
-
9c129d6131
ADR-0023 D9.7+: charge PE↔bank fabric hop for SRAM/HBM IPCQ slots
mukesh
2026-04-28 18:20:28 -07:00
-
533e699299
IPCQ-DMA co-design HW design doc + fix IPCQ slot BW model
ywkang
2026-04-28 13:31:02 -07:00
-
54fcb7e4bc
Add tests/test_emit_ipcq_diagram.py (missed from earlier commit)
mukesh
2026-04-27 21:42:44 -07:00
-
ad5f01ab13
Merge origin/master: combine single-cube fast path + center-root reduce
mukesh
2026-04-27 21:41:46 -07:00
-
-
1c5752a9ec
Intercube allreduce: center root + bidirectional reduce
mukesh
2026-04-27 21:28:58 -07:00
-
84a1325e5c
ADR-0023 D9.7: IPCQ slot-memory latency model (TCM/SRAM/HBM)
mukesh
2026-04-27 21:28:34 -07:00
-
1e39214f89
Move generated diagrams to docs/diagrams/; add IPCQ diagram emitter
mukesh
2026-04-27 21:28:17 -07:00
-
fca24feac5
Fix all remaining test failures: single-cube allreduce + matplotlib dep
ywkang
2026-04-27 21:25:31 -07:00
-
d55dc6cb4f
Merge: accept remote pe2pe summary.csv
ywkang
2026-04-27 17:13:06 -07:00
-
-
46291bf91b
PE-to-PE latency: drop h5 inter-SIP panel from overview
mukesh
2026-04-27 16:43:28 -07:00
-
04c912f53e
Allreduce sweep: parametrized + xdist parallelism + topology diagram
mukesh
2026-04-27 16:43:19 -07:00
-
1c33afec55
ADR-0032 + intra_* opposite directions in IPCQ install
mukesh
2026-04-27 16:43:01 -07:00
-
81cc32c46b
ADR-0001 Rev 2: 51-bit PhysAddr layout with concrete sub-unit tables
ywkang
2026-04-27 15:52:29 -07:00
-
e9cc40f74d
Rectangular SIP topology + 6-device allreduce sweep
mukesh
2026-04-27 15:13:14 -07:00
-
c1a5cf3a2a
ADR-0009 D5: chain-aware target_start_ns + zero-byte launch fanout
mukesh
2026-04-27 15:12:58 -07:00
-
90874abbfe
ADR-0023 D9: blocking credit-emit with full-path latency
mukesh
2026-04-27 15:12:38 -07:00
-
19dfc86dc3
Allreduce latency sweep across topologies and data sizes
mukesh
2026-04-27 10:16:29 -07:00
-
14d800b0ae
Kernel-launch sync (ADR-0009 D5) and IPCQ drain at inbound (ADR-0023)
mukesh
2026-04-23 15:30:29 -07:00
-
6918e6e906
PE-to-PE latency test + supporting fixes
mukesh
2026-04-22 21:04:31 -07:00
-
1d8b9401e5
Intercube allreduce: pe0 cube-mesh reduce + multi-SIP ring/torus/mesh
kernbench_Apr16
mukesh
2026-04-16 17:33:42 -07:00
-
cfc2d74ec4
Refactor ccl_allreduce bench: rank=SIP only, remove rank=PE legacy path
ywkang
2026-04-14 16:45:27 -07:00
-
105f1dc09e
ADR-0027: Megatron TP API + worker-wait generalization + mp.spawn
ywkang
2026-04-14 16:31:13 -07:00
-
e7f376ebaa
ADR-0027 rev7 (Megatron TP + worker-wait generalization) + ADR-0026 typo fix
ywkang
2026-04-14 14:13:26 -07:00
-
357cab525b
ADR-0026: DPPolicy intra-device only + ShardSpec structural coords
ywkang
2026-04-14 13:02:19 -07:00
-
787409ced1
ADR-0024 Phase B: update xfail reason with architectural blocker details
ywkang
2026-04-14 12:46:33 -07:00
-
79124daab1
ADR-0024 Phase B (partial): scheduler-level collective drain
ywkang
2026-04-14 09:14:03 -07:00
-
4ba0a83e71
Implement ADR-0024 Phase A: SIP-level TP launcher MVP
ywkang
2026-04-14 09:00:28 -07:00
-
32536daf2e
Fix ADR-0025: IPCQ direction addressing via address-based matching
ywkang
2026-04-14 00:38:41 -07:00
-
e1084800ab
docs: add ADRs 0024–0031 for SIP-TP launcher stack
ywkang
2026-04-14 00:38:27 -07:00
-
b2c52f0e34
Add English translations for ADR-0018, 0019, 0020, 0021
ywkang
2026-04-13 16:31:32 -07:00
-
10b33b44ba
Add Tensor indexing + hierarchical 3-level all-reduce kernel
ywkang
2026-04-12 23:52:04 -07:00
-
1c8ddc2d03
Fix Phase 1 slot-overwrite race + PE_MATH latency model (n_slots=4 safe)
ywkang
2026-04-12 23:02:19 -07:00
-
74f5f5cf08
Add session-scoped topology fixture in tests/conftest.py
ywkang
2026-04-12 21:13:25 -07:00
-
372c987995
Reduce test time to 12s: shrink GEMM dims + enable pytest-xdist
ywkang
2026-04-12 21:06:41 -07:00
-
bcf941dcee
Speed up regression: 25min → 6min (test matrix + DataExecutor cleanup)
ywkang
2026-04-12 20:52:07 -07:00
-
998cc85762
Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)
ywkang
2026-04-12 19:36:59 -07:00
-
ff2c677a9c
Add 2D grid program_id semantics (ADR-0022)
ywkang
2026-04-09 16:49:56 -07:00
-
dc3fb02aed
Add --verify-data CLI flag, Tensor.data property, parallel DataExecutor
ywkang
2026-04-09 09:34:01 -07:00
-
59e36f0c34
Add E2E pipeline tests: greenlet op_log, GEMM accuracy, latency regression
ywkang
2026-04-09 00:28:03 -07:00
-
81ce55571d
Rename impl names: add builtin. prefix for clear provenance
ywkang
2026-04-09 00:16:24 -07:00
-
1d95df4bee
Restructure legacy backups, remove pe_accel, fix DMA self-routing
ywkang
2026-04-09 00:02:26 -07:00
-
95d583ef9f
Add Phase 1→Phase 2 e2e data tests + GraphEngine enable_data mode
ywkang
2026-04-08 23:49:28 -07:00
-
f5d1606f9d
Add ADR-0021 pipeline tests: self-routing, tiling, overlap
ywkang
2026-04-08 23:40:19 -07:00
-
b6eb97c49a
Implement ADR-0021: PE pipeline refactor with token self-routing
ywkang
2026-04-08 23:35:31 -07:00
-
161132cdcb
ADR-0021: PE pipeline refactor — component separation + token self-routing
ywkang
2026-04-08 23:21:40 -07:00
-
51004c311c
Implement ADR-0020: 2-pass data execution with greenlet kernel runner
ywkang
2026-04-08 00:22:44 -07:00
-
140b85436a
ADR-0020: 2-Pass data execution model with greenlet kernel runner
ywkang
2026-04-07 23:53:49 -07:00
-
eb792e6212
Remove xbar/noc remnants, rule-based cube-view connectors
ywkang
2026-04-06 23:59:12 -07:00
-
7640635f90
M_CPU/SRAM placement via pos_mm in topology.yaml (nearest router)
ywkang
2026-04-05 00:48:20 -07:00
-
3ea4fa90f8
Cube-view: increase 45° stub length and component gap for visibility
ywkang
2026-04-05 00:38:27 -07:00
-
5125d92c17
Cube-view: M_CPU north, 45° stub-straight-stub connector pattern
ywkang
2026-04-05 00:34:48 -07:00
-
72acc5c8bb
Cube-view: UCIe flush against cube edges
ywkang
2026-04-05 00:28:58 -07:00
-
bde76ec959
Cube-view: 45° diagonal from router, then straight to component
ywkang
2026-04-05 00:25:41 -07:00
-
d3de982ea4
Cube-view: 90° router mesh links, 45° component connectors
ywkang
2026-04-05 00:20:28 -07:00
-
df81835d84
Cube-view: UCIe position/size from topology.yaml (ucie_mm.size=2.0)
ywkang
2026-04-05 00:11:11 -07:00
-
66ec6cd40c
Cube-view: UCIe components inside cube boundary with port boxes
ywkang
2026-04-04 23:58:32 -07:00