687c98086d
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
(dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
retroactive docs pending verification.
Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
deleted; ADR-0019/0021 moved to adr-history with one-line stub status
Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
selection, flit-aware per-flit commit, async finalize, command-only
fallback path)
Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
"Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
(now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)
Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py
Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.
Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
(ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
214 lines
8.2 KiB
Python
214 lines
8.2 KiB
Python
"""Phase 1 micro-tests for IPCQ slot-memory PHYSICAL placement.
|
||
|
||
The current model in ``_BUFFER_KIND_BW`` (src/kernbench/common/ipcq_types.py)
|
||
charges only an intrinsic-memory term for IPCQ slot read/write::
|
||
|
||
TCM: nbytes/512 + 0
|
||
SRAM: nbytes/512 + 2
|
||
HBM: nbytes/256 + 6
|
||
|
||
This treats SRAM and HBM as if they were per-PE local. The topology
|
||
declares the opposite — both live on the cube NoC, behind their own
|
||
router-attached link::
|
||
|
||
topology.yaml:130 sram_to_router_bw_gbs: 128.0
|
||
topology.yaml:129 hbm_to_router_bw_gbs: 256.0
|
||
|
||
So a correct model must charge a PE→bank fabric drain for SRAM and HBM
|
||
on both ``tl.send`` (writer landing bytes into the cube SRAM/HBM bank
|
||
via PE_DMA → router → bank) and ``tl.recv`` (reader pulling bytes back
|
||
across the same link). TCM stays free of that hop because it is
|
||
genuinely per-PE local.
|
||
|
||
The three tests below run the existing torus_2d 6-SIP allreduce harness
|
||
with ``buffer_kind`` flipped between tcm/sram/hbm and assert invariants
|
||
that the post-fix model must satisfy. They EXPECT TO FAIL today because
|
||
the simulator under-charges SRAM and HBM by skipping the PE↔bank hop.
|
||
|
||
Phase 2 will edit:
|
||
- src/kernbench/components/builtin/pe_ipcq.py (_handle_recv: add
|
||
compute_drain_ns(pe→bank, nbytes) for sram/hbm)
|
||
- src/kernbench/components/builtin/pe_dma.py (_handle_ipcq_inbound:
|
||
add second-leg drain for sram/hbm-destined slots)
|
||
|
||
Tests must NEVER be weakened to make Phase 2 pass — invariants below
|
||
follow from physics (link BW × payload), so any model reflecting the
|
||
topology will satisfy them by construction.
|
||
"""
|
||
from __future__ import annotations
|
||
|
||
from pathlib import Path
|
||
|
||
import pytest
|
||
import yaml
|
||
|
||
from kernbench.runtime_api.context import RuntimeContext
|
||
from kernbench.runtime_api.types import DeviceSelector
|
||
from kernbench.sim_engine.engine import GraphEngine
|
||
from kernbench.topology.builder import resolve_topology
|
||
|
||
from tests.test_allreduce_multidevice import (
|
||
_write_temp_configs,
|
||
run_allreduce,
|
||
)
|
||
|
||
|
||
def _run_allreduce_with_buffer_kind(
|
||
tmp_path: Path, *, buffer_kind: str, n_elem: int,
|
||
) -> float:
|
||
"""Run one torus_2d 6-SIP allreduce with the given buffer_kind and
|
||
return critical-path pe_exec_ns (max across all PEs).
|
||
|
||
Mirrors the sweep harness in test_allreduce_buffer_kind_sweep.py
|
||
so the assertions below compare apples-to-apples against that PNG.
|
||
"""
|
||
sub = tmp_path / f"{buffer_kind}_{n_elem}"
|
||
sub.mkdir()
|
||
topo_path, ccl_path = _write_temp_configs(
|
||
sub,
|
||
sip_topology="torus_2d",
|
||
n_sips=6,
|
||
algorithm="intercube_allreduce",
|
||
sip_w=3, sip_h=2,
|
||
n_elem_override=n_elem,
|
||
)
|
||
|
||
with open(ccl_path) as f:
|
||
ccl_cfg = yaml.safe_load(f)
|
||
ccl_cfg.setdefault("defaults", {})["buffer_kind"] = buffer_kind
|
||
ccl_cfg.setdefault("algorithms", {}).setdefault(
|
||
"intercube_allreduce", {},
|
||
)["buffer_kind"] = buffer_kind
|
||
with open(ccl_path, "w") as f:
|
||
yaml.dump(ccl_cfg, f, default_flow_style=False)
|
||
|
||
topo = resolve_topology(topo_path)
|
||
engine = GraphEngine(topo.topology_obj, enable_data=True)
|
||
spec = topo.topology_obj.spec
|
||
|
||
with RuntimeContext(
|
||
engine=engine,
|
||
target_device=DeviceSelector("all"),
|
||
correlation_id=f"loc_{buffer_kind}_{n_elem}",
|
||
spec=spec,
|
||
) as ctx:
|
||
result = run_allreduce(
|
||
ctx, engine, spec,
|
||
algorithm="intercube_allreduce", ccl_yaml=ccl_path,
|
||
)
|
||
assert result["ok_cubes"] > 0, "allreduce did not validate"
|
||
|
||
pe_exec_vals = [
|
||
float(tr.get("pe_exec_ns", 0.0) or 0.0)
|
||
for _, (_, tr) in engine._results.items()
|
||
if isinstance(tr, dict)
|
||
]
|
||
return max(pe_exec_vals) if pe_exec_vals else 0.0
|
||
|
||
|
||
# ── Phase 1 assertions ───────────────────────────────────────────────
|
||
|
||
|
||
def test_sram_meaningfully_slower_than_tcm_at_large_payload(tmp_path):
|
||
"""At 32 KB / PE the SRAM-backed allreduce must take meaningfully
|
||
longer than the TCM-backed one because every IPCQ slot access goes
|
||
through the 128 GB/s SRAM↔router link, while TCM stays per-PE local.
|
||
|
||
Floor justification (physics, not implementation):
|
||
Per-IPCQ-roundtrip the SRAM tier adds 2 × nbytes/128 ns over TCM
|
||
(one PE→SRAM hop on send-inbound, one SRAM→PE hop on recv).
|
||
At 32 KB: 2 × 32768/128 = 512 ns added per slot exchange.
|
||
With ≥ 10 critical-path exchanges in a 6-SIP torus_2d allreduce
|
||
this is ≥ 5_120 ns. The threshold below is half that to leave
|
||
room for differing critical-path counting.
|
||
|
||
Pre-Phase-2: gap is constant 48 ns (just the SRAM overhead × 24
|
||
slot accesses); test FAILS.
|
||
Post-Phase-2: gap scales with payload; test PASSES.
|
||
"""
|
||
n_elem = 16384 # 32 KB / PE
|
||
lat_tcm = _run_allreduce_with_buffer_kind(
|
||
tmp_path, buffer_kind="tcm", n_elem=n_elem,
|
||
)
|
||
lat_sram = _run_allreduce_with_buffer_kind(
|
||
tmp_path, buffer_kind="sram", n_elem=n_elem,
|
||
)
|
||
delta = lat_sram - lat_tcm
|
||
THRESHOLD_NS = 2_500.0
|
||
assert delta > THRESHOLD_NS, (
|
||
f"SRAM should be ≥ {THRESHOLD_NS:.0f} ns slower than TCM at 32 KB "
|
||
f"because each IPCQ access pays a 128 GB/s PE↔SRAM hop. "
|
||
f"got tcm={lat_tcm:.1f} sram={lat_sram:.1f} delta={delta:.1f} ns"
|
||
)
|
||
|
||
|
||
def test_sram_tcm_gap_scales_with_payload(tmp_path):
|
||
"""The SRAM-vs-TCM gap must grow roughly linearly with payload size.
|
||
|
||
Pre-Phase-2: the only difference between TCM and SRAM is the SRAM
|
||
per-access ``overhead_ns = 2``, which does NOT scale with payload —
|
||
so the gap is the same constant 48 ns at 8 KB and at 32 KB. Ratio = 1.
|
||
|
||
Post-Phase-2: the dominant term is 2 × nbytes/128 (PE↔SRAM hop on
|
||
write+read) which IS linear in payload. Going 8 KB → 32 KB (4×)
|
||
should produce a gap roughly 4× larger.
|
||
|
||
Threshold below is 3× to keep slack for fixed-overhead effects.
|
||
"""
|
||
lat_tcm_small = _run_allreduce_with_buffer_kind(
|
||
tmp_path, buffer_kind="tcm", n_elem=4096, # 8 KB
|
||
)
|
||
lat_sram_small = _run_allreduce_with_buffer_kind(
|
||
tmp_path, buffer_kind="sram", n_elem=4096,
|
||
)
|
||
lat_tcm_large = _run_allreduce_with_buffer_kind(
|
||
tmp_path, buffer_kind="tcm", n_elem=16384, # 32 KB
|
||
)
|
||
lat_sram_large = _run_allreduce_with_buffer_kind(
|
||
tmp_path, buffer_kind="sram", n_elem=16384,
|
||
)
|
||
|
||
gap_small = lat_sram_small - lat_tcm_small
|
||
gap_large = lat_sram_large - lat_tcm_large
|
||
|
||
assert gap_small > 0, (
|
||
f"sanity: SRAM should never be FASTER than TCM, "
|
||
f"got gap_small={gap_small:.1f} ns"
|
||
)
|
||
assert gap_large > 3.0 * gap_small, (
|
||
f"4× payload should produce ≥3× SRAM/TCM gap (linear in nbytes "
|
||
f"because of the 128 GB/s PE↔SRAM hop). "
|
||
f"got gap_small={gap_small:.1f} (8KB), gap_large={gap_large:.1f} "
|
||
f"(32KB), ratio={gap_large / max(gap_small, 1e-9):.2f}"
|
||
)
|
||
|
||
|
||
def test_hbm_pe_hop_charged_at_large_payload(tmp_path):
|
||
"""At 32 KB / PE the HBM-vs-TCM gap must exceed the gap that comes
|
||
purely from HBM's 256 GB/s intrinsic slot-IO disadvantage.
|
||
|
||
Pre-Phase-2 the entire HBM/TCM gap is just the slot-IO term
|
||
(24 × (nbytes/512 + 6) ≈ 1_700 ns at 32 KB). Post-fix adds another
|
||
chunk of latency from the PE↔HBM hop on send and recv, so the
|
||
total HBM/TCM gap should clearly clear the threshold below.
|
||
|
||
Under ADR-0017 D4 per-PE HBM CTRL, each PE's slice runs on its own
|
||
controller with no cross-PE contention, so the IPCQ pattern (each
|
||
PE writes its own slice) yields a gap of ≈ 1.7 µs — well above the
|
||
bare slot-IO term, confirming the PE↔HBM hop is being charged.
|
||
"""
|
||
n_elem = 16384 # 32 KB / PE
|
||
lat_tcm = _run_allreduce_with_buffer_kind(
|
||
tmp_path, buffer_kind="tcm", n_elem=n_elem,
|
||
)
|
||
lat_hbm = _run_allreduce_with_buffer_kind(
|
||
tmp_path, buffer_kind="hbm", n_elem=n_elem,
|
||
)
|
||
delta = lat_hbm - lat_tcm
|
||
THRESHOLD_NS = 1_500.0
|
||
assert delta > THRESHOLD_NS, (
|
||
f"HBM should be ≥ {THRESHOLD_NS:.0f} ns slower than TCM at 32 KB "
|
||
f"once the 256 GB/s PE↔HBM hop is charged on each IPCQ access. "
|
||
f"got tcm={lat_tcm:.1f} hbm={lat_hbm:.1f} delta={delta:.1f} ns"
|
||
)
|