kernbench2

ywkang/kernbench2

Fork 0

Commit Graph

Author	SHA1	Message	Date
mukesh	ff7d727ddd	CCL allreduce: rename to lrab_hierarchical_allreduce + descriptive plots Rename the intercube all-reduce identity to lrab_hierarchical_allreduce (module, config key, distributed test) so the name reflects both levels it implements: LRAB intra-SIP (local reduce to center root + broadcast) and the hierarchical inter-SIP topology exchange (ring/torus/mesh). ADR-0032 slug kept as the stable decision id; pure rename, no logic change. Also in this batch: - ADR-0032 (EN+KO): document the shipped center-root bidirectional reduce (doc was stale corner-root); annotate ccl.yaml root_cube as a placeholder. - Rename allreduce + pe2pe latency plots to descriptive, title-matching filenames and retitle the in-plot headings; drop overview/overview_log. - Point the PPTX image refs at the new plot names. Doc + derived-artifact + rename only; no simulation behavior changed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 20:50:48 -07:00
ywkang	6824a935c9	Calibrate 3 tests for ADR-0033 Phase 2c per-flit wire timing - test_h2d_local_cube_cut_through: threshold 65 → 80ns. The cut-through invariant (vs store-and-forward ~160ns at 4KB through UCIe) is what the test guards; the previous 65ns ceiling was too tight against the small per-flit overhead now charged at wire. - test_engine_override_is_scoped_to_impl: ZeroRouter inherits TransitComponent (was ComponentBase). Inheriting bare ComponentBase reverts the override path to non-flit-aware reassembly, making override slower than default and inverting the test. The test's intent is overhead=0 vs overhead=2, not flit-awareness. - test_intra_sip_critical_path_at_96k_below_threshold: threshold 20.5 → 30 µs. Allreduce absolute timing is sensitive to model fidelity; the algorithmic invariant (8-hop center root < 12-hop corner root) is preserved within the new envelope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 23:06:33 -07:00
mukesh	1c5752a9ec	Intercube allreduce: center root + bidirectional reduce Move the algorithmic root cube from the corner (cube_w-1, cube_h-1) to the geometric center (cube_w//2, cube_h//2) and have each phase converge bidirectionally so the intra-SIP critical path drops from ~12 hops to ~8 hops on a 4×4 mesh (left half W→E + right half E→W in row reduce; top half N→S + bottom half S→N in col reduce; mirrored on broadcast). Result on torus_2d 6 SIPs at 96 KB / PE on TCM: before (corner root) : 22.0 µs after (center root) : 17.2 µs (−22%) Same shape on ring_1d (−7%) and mesh_2d_no_wrap (−12%); also holds across SRAM and HBM (~−20% each). Phase 1 test (test_intercube_root_center.py) asserts the torus_2d 96 KB latency drops below 20.5 µs and that all 96 cubes still validate (correctness preserved). Plot updates: - overview.png: replace constant 10.6 µs theoretical line with user-supplied hand-derived curve (per-cube packet count = bytes_per_pe × 8 PEs ÷ 128 B; 1346 ns startup + 1.20 ns/pkt). - All summary.csv numbers and per-topology PNGs regenerated. - pe2pe_latency_plots and ipcq diagram emitter PNGs refreshed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 21:28:58 -07:00

Author

SHA1

Message

Date

mukesh

ff7d727ddd

CCL allreduce: rename to lrab_hierarchical_allreduce + descriptive plots

Rename the intercube all-reduce identity to lrab_hierarchical_allreduce
(module, config key, distributed test) so the name reflects both levels
it implements: LRAB intra-SIP (local reduce to center root + broadcast)
and the hierarchical inter-SIP topology exchange (ring/torus/mesh).
ADR-0032 slug kept as the stable decision id; pure rename, no logic change.

Also in this batch:
- ADR-0032 (EN+KO): document the shipped center-root bidirectional reduce
  (doc was stale corner-root); annotate ccl.yaml root_cube as a placeholder.
- Rename allreduce + pe2pe latency plots to descriptive, title-matching
  filenames and retitle the in-plot headings; drop overview/overview_log.
- Point the PPTX image refs at the new plot names.

Doc + derived-artifact + rename only; no simulation behavior changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 20:50:48 -07:00

ywkang

6824a935c9

Calibrate 3 tests for ADR-0033 Phase 2c per-flit wire timing

- test_h2d_local_cube_cut_through: threshold 65 → 80ns. The cut-through
  invariant (vs store-and-forward ~160ns at 4KB through UCIe) is what
  the test guards; the previous 65ns ceiling was too tight against the
  small per-flit overhead now charged at wire.
- test_engine_override_is_scoped_to_impl: ZeroRouter inherits
  TransitComponent (was ComponentBase). Inheriting bare ComponentBase
  reverts the override path to non-flit-aware reassembly, making
  override slower than default and inverting the test. The test's
  intent is overhead=0 vs overhead=2, not flit-awareness.
- test_intra_sip_critical_path_at_96k_below_threshold: threshold
  20.5 → 30 µs. Allreduce absolute timing is sensitive to model
  fidelity; the algorithmic invariant (8-hop center root < 12-hop
  corner root) is preserved within the new envelope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-14 23:06:33 -07:00

mukesh

1c5752a9ec

Intercube allreduce: center root + bidirectional reduce

Move the algorithmic root cube from the corner (cube_w-1,
cube_h-1) to the geometric center (cube_w//2, cube_h//2) and
have each phase converge bidirectionally so the intra-SIP
critical path drops from ~12 hops to ~8 hops on a 4×4 mesh
(left half W→E + right half E→W in row reduce; top half N→S +
bottom half S→N in col reduce; mirrored on broadcast).

Result on torus_2d 6 SIPs at 96 KB / PE on TCM:
  before (corner root)  : 22.0 µs
  after  (center root)  : 17.2 µs   (−22%)

Same shape on ring_1d (−7%) and mesh_2d_no_wrap (−12%); also
holds across SRAM and HBM (~−20% each).

Phase 1 test (test_intercube_root_center.py) asserts the
torus_2d 96 KB latency drops below 20.5 µs and that all 96
cubes still validate (correctness preserved).

Plot updates:
- overview.png: replace constant 10.6 µs theoretical line with
  user-supplied hand-derived curve (per-cube packet count =
  bytes_per_pe × 8 PEs ÷ 128 B; 1346 ns startup + 1.20 ns/pkt).
- All summary.csv numbers and per-topology PNGs regenerated.
- pe2pe_latency_plots and ipcq diagram emitter PNGs refreshed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-27 21:28:58 -07:00

3 Commits