Files
mukesh ff7d727ddd CCL allreduce: rename to lrab_hierarchical_allreduce + descriptive plots
Rename the intercube all-reduce identity to lrab_hierarchical_allreduce
(module, config key, distributed test) so the name reflects both levels
it implements: LRAB intra-SIP (local reduce to center root + broadcast)
and the hierarchical inter-SIP topology exchange (ring/torus/mesh).
ADR-0032 slug kept as the stable decision id; pure rename, no logic change.

Also in this batch:
- ADR-0032 (EN+KO): document the shipped center-root bidirectional reduce
  (doc was stale corner-root); annotate ccl.yaml root_cube as a placeholder.
- Rename allreduce + pe2pe latency plots to descriptive, title-matching
  filenames and retitle the in-plot headings; drop overview/overview_log.
- Point the PPTX image refs at the new plot names.

Doc + derived-artifact + rename only; no simulation behavior changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 20:50:48 -07:00

51 lines
1.9 KiB
YAML
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ccl.yaml — CCL backend (ahbm) configuration (ADR-0023 D11)
#
# Loaded by AhbmCCLBackend at init_process_group time.
# defaults.algorithm chooses which kernel + topology is installed
# into PE_IPCQ neighbor tables. Host code is unaware of these settings.
defaults:
# Algorithm to run for this benchmark execution.
algorithm: lrab_hierarchical_allreduce
# IPCQ ring buffer location.
# tcm — PE-local TCM (fast, small, conflicts with compute TCM access)
# hbm — PE-local HBM (large, slower DMA latency)
# sram — Cube-shared SRAM (medium, cube-internal contention)
buffer_kind: tcm
# Backpressure mode.
# poll — spin-loop polling of cached peer pointers
# sleep — yield SimPy event, wake on credit return
backpressure: sleep
# Ring depth: number of slots per (direction, tx|rx) buffer.
n_slots: 4
# Slot size in bytes (must hold one tile worth of data).
slot_size: 4096
# PE_DMA virtual channel chunk size (D8).
vc_chunk_size: 256
# Credit return fast path message size (D9).
ipcq_credit_size_bytes: 16
algorithms:
# ── intercube all-reduce (pe0-only, cube mesh + inter-SIP) ──
# Reduces across the 4×4 cube mesh within each SIP, then inter-SIP
# exchange on root cube, then broadcast back. SIP topology is read
# from topology.yaml → system.sips.topology. Kernel auto-selects
# ring / torus / mesh inter-SIP exchange pattern.
lrab_hierarchical_allreduce:
module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
topology: none
buffer_kind: tcm
n_elem: 8
# root_cube: the kernel currently elects the root dynamically as the
# geometric center of the cube mesh (root = (h//2)*w + (w//2)) to
# minimize the intra-SIP critical path, so this value is NOT read today.
# Kept as a placeholder for a future explicit-root override / runtime
# election hook (see ADR-0032 D1 + Non-goals).
root_cube: 15