kernbench2/docs/diagrams/allreduce_latency_plots/buffer_kind_sweep.csv at a796c1d2f71b9b2b0fc9e840f5ec881be9e8af32

Files

T

ywkang b8213d43a9 ADR-0019 D1/D4: per-PE HBM CTRL partitioning

Restores per-PE HBM controller partitioning that was lost in
commit 5917b34 ("Replace xbar/bridge/single-NOC with explicit
router mesh"), which had over-consolidated the per-slice HBM CTRL
into a single cube-wide ``hbm_ctrl`` connected to every router —
the opposite of what ADR-0019 D1/D4 specifies.

Builder splits ``hbm_ctrl`` into 8 ``hbm_ctrl.pe{X}`` instances per
cube, each reachable ONLY through PE_X's attaching router via the
existing ``peX.hbm`` attach metadata from cube_mesh.yaml. Cube
aggregate BW now matches the spec (8 PEs × 8 PCs × 32 GB/s =
2048 GB/s) instead of collapsing to 256 GB/s.

AddressResolver decodes the target PE from the HBM PA's hbm_offset
(``offset // slice_size``) and returns ``hbm_ctrl.pe{X}``. PathRouter
uses the existing ``_adj_local`` adjacency for same-cube PE_DMA so
the cube's own UCIe port can no longer appear as a zero-distance
shortcut between routers — local PE_DMA now traverses the mesh,
restoring the ADR-0019 D4 worked example
``PE0.pe_dma → r0c0 → … → r1c4 → hbm_ctrl``.

Tests:
- New tests/test_per_pe_hbm_partition.py: 14 tests covering
  topology shape, per-PE router exclusivity, PA resolution,
  single-hop local path, cross-PE mesh traversal, and end-to-end
  latency monotonicity. Probe CLI now reports
  pe-local < pe-same-half < pe-cross-half (was uniform 141ns).
- Existing tests updated for new node ids and replaced two
  assertions that locked in the wrong consolidation:
  test_noc_mesh.test_hbm_connects_to_all_routers and
  test_topology_compile.test_hbm_ctrl_connects_all_routers are
  now per-PE exclusivity assertions; test_routing
  .test_all_pe_hbm_equidistant becomes
  test_cross_pe_hbm_distance_increases_with_mesh_hops.
- test_ipcq_buffer_kind_locations.test_hbm_pe_hop_charged_at_large_payload
  threshold recalibrated 4000→1500 ns: the prior figure reflected
  serialization on the over-consolidated single hbm_ctrl; per-PE
  partitioning removes that artificial contention so the gap
  shrinks to the genuine PE↔HBM-hop cost.

Full suite: 645 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-15 01:04:30 -07:00

602 B

Raw Blame History

1	buffer_kind	sip_topology	n_sips	n_elem	bytes_per_pe	latency_ns
2	hbm	torus_2d	6	128	256	2120.0399999999754
3	hbm	torus_2d	6	1024	2048	2716.74499999995
4	hbm	torus_2d	6	8192	16384	7315.185000000081
5	hbm	torus_2d	6	32768	65536	23081.265000008738
6	sram	torus_2d	6	128	256	2060.0399999999754
7	sram	torus_2d	6	1024	2048	2908.74499999995
8	sram	torus_2d	6	8192	16384	9523.185000000081
9	sram	torus_2d	6	32768	65536	32201.265000008752
10	tcm	torus_2d	6	128	256	1964.0399999999754
11	tcm	torus_2d	6	1024	2048	2476.74499999995
12	tcm	torus_2d	6	8192	16384	6403.185000000081
13	tcm	torus_2d	6	32768	65536	19865.265000008738

602 B Raw Blame History

602 B

Raw Blame History