CCL allreduce: rename to lrab_hierarchical_allreduce + descriptive plots
Rename the intercube all-reduce identity to lrab_hierarchical_allreduce (module, config key, distributed test) so the name reflects both levels it implements: LRAB intra-SIP (local reduce to center root + broadcast) and the hierarchical inter-SIP topology exchange (ring/torus/mesh). ADR-0032 slug kept as the stable decision id; pure rename, no logic change. Also in this batch: - ADR-0032 (EN+KO): document the shipped center-root bidirectional reduce (doc was stale corner-root); annotate ccl.yaml root_cube as a placeholder. - Rename allreduce + pe2pe latency plots to descriptive, title-matching filenames and retitle the in-plot headings; drop overview/overview_log. - Point the PPTX image refs at the new plot names. Doc + derived-artifact + rename only; no simulation behavior changed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -4,8 +4,8 @@ Slides:
|
||||
1. Overall architecture — how PEs are connected (cube_mesh_view)
|
||||
2. Model correctness — DMA vs P2P latency (pe2pe overview)
|
||||
3. PE-to-PE IPCQ communication (ipcq_two_pe_dma)
|
||||
4. 6-device allreduce — model vs theoretical vs ext-sim (overview_broken)
|
||||
5. IPCQ buffer-kind sweep — TCM vs SRAM vs HBM (buffer_kind_sweep)
|
||||
4. 6-device allreduce — model vs theoretical vs FSIM (comparison_…_fsim)
|
||||
5. IPCQ buffer-kind sweep — TCM vs SRAM vs HBM (…_with_TCM_SRAM_HBM)
|
||||
6. PE_accelerator data path (composite GEMM pipeline structure)
|
||||
7. matmul(32, 128, 32) — composite GEMM execution sequence
|
||||
8. matmul(32, 128, 128) — pipeline scaling and HBM contention
|
||||
@@ -63,7 +63,7 @@ SLIDES = [
|
||||
},
|
||||
{
|
||||
"title": "4. 6-Device Allreduce: Model vs Theoretical vs External Simulator",
|
||||
"image": DIAG / "allreduce_latency_plots" / "overview_broken.png",
|
||||
"image": DIAG / "allreduce_latency_plots" / "comparison_mesh_vs_ring_vs_2DTorus_vs_theoretical_vs_fsim.png",
|
||||
"bullets": [
|
||||
"Three SIP topologies (ring / torus / mesh) swept 16 B → 96 KB per PE",
|
||||
"Dashed red curve: hand-derived theoretical model for torus_2d (6 SIPs)",
|
||||
@@ -73,7 +73,7 @@ SLIDES = [
|
||||
},
|
||||
{
|
||||
"title": "5. IPCQ Slot Memory: TCM vs SRAM vs HBM",
|
||||
"image": DIAG / "allreduce_latency_plots" / "buffer_kind_sweep.png",
|
||||
"image": DIAG / "allreduce_latency_plots" / "AllReduce_LRAB_2Dtorus_6SiP_2x3_with_TCM_SRAM_HBM.png",
|
||||
"bullets": [
|
||||
"Same allreduce with slot memory swapped: TCM (per-PE local) / SRAM / HBM (cube-shared, behind router link)",
|
||||
"Cost = NoC drain + slot-IO + PE↔bank hop; only TCM skips the bank hop",
|
||||
|
||||
Reference in New Issue
Block a user