5 Commits

Author SHA1 Message Date
mukesh 54fcb7e4bc Add tests/test_emit_ipcq_diagram.py (missed from earlier commit)
This is the diagram generator that emits ipcq_send_recv.png and
ipcq_two_pe_dma.png (referenced by commit 1e39214 but accidentally
left untracked).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 21:42:44 -07:00
mukesh ad5f01ab13 Merge origin/master: combine single-cube fast path + center-root reduce
Conflict resolution:
- intercube_allreduce.py: kept origin's `if single_cube:` early-exit
  (TP launches kernel on one cube/rank → skip intra-SIP mesh and go
  direct to inter-SIP exchange) AND replaced the multi-cube body with
  the local center-root + bidirectional reduce/broadcast (8-hop
  critical path on 4×4 vs 12 with corner root).
- tests/{allreduce,pe2pe}_latency_plots/: kept the local move to
  docs/diagrams/; dropped origin's stale content edits to the old
  paths (regenerable derived artifacts).
- docs/diagrams/pe2pe_latency_plots/summary.csv: kept local
  (post-Phase-2 + center-root values).

Origin contributions retained as-is:
- pyproject.toml: matplotlib >= 3.7 dep.
- runtime_api/distributed.py: derive effective cube_w/h from tensor
  shard placement so single-cube TP paths get cube_w=cube_h=1.
- kernel_args() now accepts optional cube_w/cube_h kwargs.

Verified post-merge:
- test_intercube_root_center.py: 2/2 (center-root multi-cube path).
- test_tp_layers.py + test_tp_mlp.py: 10/10 (single-cube TP path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 21:41:46 -07:00
mukesh 1c5752a9ec Intercube allreduce: center root + bidirectional reduce
Move the algorithmic root cube from the corner (cube_w-1,
cube_h-1) to the geometric center (cube_w//2, cube_h//2) and
have each phase converge bidirectionally so the intra-SIP
critical path drops from ~12 hops to ~8 hops on a 4×4 mesh
(left half W→E + right half E→W in row reduce; top half N→S +
bottom half S→N in col reduce; mirrored on broadcast).

Result on torus_2d 6 SIPs at 96 KB / PE on TCM:
  before (corner root)  : 22.0 µs
  after  (center root)  : 17.2 µs   (−22%)

Same shape on ring_1d (−7%) and mesh_2d_no_wrap (−12%); also
holds across SRAM and HBM (~−20% each).

Phase 1 test (test_intercube_root_center.py) asserts the
torus_2d 96 KB latency drops below 20.5 µs and that all 96
cubes still validate (correctness preserved).

Plot updates:
- overview.png: replace constant 10.6 µs theoretical line with
  user-supplied hand-derived curve (per-cube packet count =
  bytes_per_pe × 8 PEs ÷ 128 B; 1346 ns startup + 1.20 ns/pkt).
- All summary.csv numbers and per-topology PNGs regenerated.
- pe2pe_latency_plots and ipcq diagram emitter PNGs refreshed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 21:28:58 -07:00
mukesh 84a1325e5c ADR-0023 D9.7: IPCQ slot-memory latency model (TCM/SRAM/HBM)
Charge per-tier bandwidth + setup overhead at IPCQ slot WRITE
(receiver inbound DMA, in pe_dma._handle_ipcq_inbound) and slot
READ (recv consume, in pe_ipcq._handle_recv). Tier table
(common/ipcq_types.py):
  tcm  : 512 GB/s, 0 ns
  sram : 128 GB/s, 2 ns
  hbm  :  32 GB/s, 6 ns

Before this change, slot read/write was free regardless of
buffer_kind, making memory-tier choice invisible in simulated
latency. After the change, swapping buffer_kind in ccl.yaml
produces measurable per-tier separation in allreduce latency.

Tests:
  test_ipcq_buffer_kind_latency.py — three micro-tests asserting
    tcm < sram < hbm ordering, payload-scaling, and that
    buffer_kind sensitivity grows with payload (credit-only path
    stays fabric-bound).
  test_allreduce_buffer_kind_sweep.py — 12-config parametrized
    sweep emitting buffer_kind_sweep.png (3 lines, torus_2d).

conftest sessionfinish hook generalised to dispatch multiple
sweep aggregators (allreduce + buffer-kind).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 21:28:34 -07:00
mukesh 1e39214f89 Move generated diagrams to docs/diagrams/; add IPCQ diagram emitter
Plot output dirs now live under docs/diagrams/ (the canonical
"derived artifacts" location per CLAUDE.md):
  tests/allreduce_latency_plots/ → docs/diagrams/allreduce_latency_plots/
  tests/pe2pe_latency_plots/     → docs/diagrams/pe2pe_latency_plots/
  + new docs/diagrams/ipcq_diagram_plots/ with two presentation diagrams
    (ipcq_send_recv.png, ipcq_two_pe_dma.png)

New test tests/test_emit_ipcq_diagram.py renders the two IPCQ
diagrams from a static description (no simulation); it exists so
the diagrams can be regenerated reproducibly.

Path references updated in tests/test_pe_to_pe_latency.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 21:28:17 -07:00
36 changed files with 1480 additions and 218 deletions
@@ -0,0 +1,13 @@
buffer_kind,sip_topology,n_sips,n_elem,bytes_per_pe,latency_ns
hbm,torus_2d,6,128,256,2002.0399999999827
hbm,torus_2d,6,1024,2048,3541.0399999999827
hbm,torus_2d,6,8192,16384,15889.03999999999
hbm,torus_2d,6,32768,65536,58225.03999999998
sram,torus_2d,6,128,256,1762.0399999999827
sram,torus_2d,6,1024,2048,2293.0399999999827
sram,torus_2d,6,8192,16384,6577.039999999986
sram,torus_2d,6,32768,65536,21265.03999999992
tcm,torus_2d,6,128,256,1678.0399999999827
tcm,torus_2d,6,1024,2048,1957.0399999999827
tcm,torus_2d,6,8192,16384,4225.039999999986
tcm,torus_2d,6,32768,65536,12001.03999999992
1 buffer_kind sip_topology n_sips n_elem bytes_per_pe latency_ns
2 hbm torus_2d 6 128 256 2002.0399999999827
3 hbm torus_2d 6 1024 2048 3541.0399999999827
4 hbm torus_2d 6 8192 16384 15889.03999999999
5 hbm torus_2d 6 32768 65536 58225.03999999998
6 sram torus_2d 6 128 256 1762.0399999999827
7 sram torus_2d 6 1024 2048 2293.0399999999827
8 sram torus_2d 6 8192 16384 6577.039999999986
9 sram torus_2d 6 32768 65536 21265.03999999992
10 tcm torus_2d 6 128 256 1678.0399999999827
11 tcm torus_2d 6 1024 2048 1957.0399999999827
12 tcm torus_2d 6 8192 16384 4225.039999999986
13 tcm torus_2d 6 32768 65536 12001.03999999992
Binary file not shown.

After

Width:  |  Height:  |  Size: 68 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 40 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 82 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

@@ -0,0 +1,37 @@
algorithm,sip_topology,n_sips,n_elem,bytes_per_pe,bytes_per_sip,latency_ns
intercube_allreduce,mesh_2d_no_wrap,6,8,16,256,2626.302499999998
intercube_allreduce,mesh_2d_no_wrap,6,32,64,1024,2634.7399999999952
intercube_allreduce,mesh_2d_no_wrap,6,64,128,2048,2645.9899999999925
intercube_allreduce,mesh_2d_no_wrap,6,128,256,4096,2668.489999999987
intercube_allreduce,mesh_2d_no_wrap,6,512,1024,16384,2812.489999999987
intercube_allreduce,mesh_2d_no_wrap,6,1024,2048,32768,3010.489999999987
intercube_allreduce,mesh_2d_no_wrap,6,2048,4096,65536,3406.489999999987
intercube_allreduce,mesh_2d_no_wrap,6,4096,8192,131072,4198.489999999965
intercube_allreduce,mesh_2d_no_wrap,6,8192,16384,262144,5782.489999999969
intercube_allreduce,mesh_2d_no_wrap,6,16384,32768,524288,8950.489999999925
intercube_allreduce,mesh_2d_no_wrap,6,32768,65536,1048576,15286.48999999986
intercube_allreduce,mesh_2d_no_wrap,6,49152,98304,1572864,21622.489999999932
intercube_allreduce,ring_1d,6,8,16,256,2302.9849999999933
intercube_allreduce,ring_1d,6,32,64,1024,2310.8599999999906
intercube_allreduce,ring_1d,6,64,128,2048,2321.359999999988
intercube_allreduce,ring_1d,6,128,256,4096,2342.3599999999824
intercube_allreduce,ring_1d,6,512,1024,16384,2479.3599999999824
intercube_allreduce,ring_1d,6,1024,2048,32768,2669.3599999999824
intercube_allreduce,ring_1d,6,2048,4096,65536,3049.3599999999824
intercube_allreduce,ring_1d,6,4096,8192,131072,3809.3599999999715
intercube_allreduce,ring_1d,6,8192,16384,262144,5329.359999999979
intercube_allreduce,ring_1d,6,16384,32768,524288,8369.35999999992
intercube_allreduce,ring_1d,6,32768,65536,1048576,14449.359999999899
intercube_allreduce,ring_1d,6,49152,98304,1572864,20529.35999999997
intercube_allreduce,torus_2d,6,8,16,256,1644.2899999999936
intercube_allreduce,torus_2d,6,32,64,1024,1651.0399999999909
intercube_allreduce,torus_2d,6,64,128,2048,1660.0399999999881
intercube_allreduce,torus_2d,6,128,256,4096,1678.0399999999827
intercube_allreduce,torus_2d,6,512,1024,16384,1795.0399999999827
intercube_allreduce,torus_2d,6,1024,2048,32768,1957.0399999999827
intercube_allreduce,torus_2d,6,2048,4096,65536,2281.0399999999827
intercube_allreduce,torus_2d,6,4096,8192,131072,2929.039999999979
intercube_allreduce,torus_2d,6,8192,16384,262144,4225.039999999986
intercube_allreduce,torus_2d,6,16384,32768,524288,6817.039999999943
intercube_allreduce,torus_2d,6,32768,65536,1048576,12001.03999999992
intercube_allreduce,torus_2d,6,49152,98304,1572864,17185.039999999994
1 algorithm sip_topology n_sips n_elem bytes_per_pe bytes_per_sip latency_ns
2 intercube_allreduce mesh_2d_no_wrap 6 8 16 256 2626.302499999998
3 intercube_allreduce mesh_2d_no_wrap 6 32 64 1024 2634.7399999999952
4 intercube_allreduce mesh_2d_no_wrap 6 64 128 2048 2645.9899999999925
5 intercube_allreduce mesh_2d_no_wrap 6 128 256 4096 2668.489999999987
6 intercube_allreduce mesh_2d_no_wrap 6 512 1024 16384 2812.489999999987
7 intercube_allreduce mesh_2d_no_wrap 6 1024 2048 32768 3010.489999999987
8 intercube_allreduce mesh_2d_no_wrap 6 2048 4096 65536 3406.489999999987
9 intercube_allreduce mesh_2d_no_wrap 6 4096 8192 131072 4198.489999999965
10 intercube_allreduce mesh_2d_no_wrap 6 8192 16384 262144 5782.489999999969
11 intercube_allreduce mesh_2d_no_wrap 6 16384 32768 524288 8950.489999999925
12 intercube_allreduce mesh_2d_no_wrap 6 32768 65536 1048576 15286.48999999986
13 intercube_allreduce mesh_2d_no_wrap 6 49152 98304 1572864 21622.489999999932
14 intercube_allreduce ring_1d 6 8 16 256 2302.9849999999933
15 intercube_allreduce ring_1d 6 32 64 1024 2310.8599999999906
16 intercube_allreduce ring_1d 6 64 128 2048 2321.359999999988
17 intercube_allreduce ring_1d 6 128 256 4096 2342.3599999999824
18 intercube_allreduce ring_1d 6 512 1024 16384 2479.3599999999824
19 intercube_allreduce ring_1d 6 1024 2048 32768 2669.3599999999824
20 intercube_allreduce ring_1d 6 2048 4096 65536 3049.3599999999824
21 intercube_allreduce ring_1d 6 4096 8192 131072 3809.3599999999715
22 intercube_allreduce ring_1d 6 8192 16384 262144 5329.359999999979
23 intercube_allreduce ring_1d 6 16384 32768 524288 8369.35999999992
24 intercube_allreduce ring_1d 6 32768 65536 1048576 14449.359999999899
25 intercube_allreduce ring_1d 6 49152 98304 1572864 20529.35999999997
26 intercube_allreduce torus_2d 6 8 16 256 1644.2899999999936
27 intercube_allreduce torus_2d 6 32 64 1024 1651.0399999999909
28 intercube_allreduce torus_2d 6 64 128 2048 1660.0399999999881
29 intercube_allreduce torus_2d 6 128 256 4096 1678.0399999999827
30 intercube_allreduce torus_2d 6 512 1024 16384 1795.0399999999827
31 intercube_allreduce torus_2d 6 1024 2048 32768 1957.0399999999827
32 intercube_allreduce torus_2d 6 2048 4096 65536 2281.0399999999827
33 intercube_allreduce torus_2d 6 4096 8192 131072 2929.039999999979
34 intercube_allreduce torus_2d 6 8192 16384 262144 4225.039999999986
35 intercube_allreduce torus_2d 6 16384 32768 524288 6817.039999999943
36 intercube_allreduce torus_2d 6 32768 65536 1048576 12001.03999999992
37 intercube_allreduce torus_2d 6 49152 98304 1572864 17185.039999999994

Before

Width:  |  Height:  |  Size: 194 KiB

After

Width:  |  Height:  |  Size: 194 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 233 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 166 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 45 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 45 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 53 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 103 KiB

@@ -1,81 +1,81 @@
hop,label,size_bytes,path,total_ns
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,ipcq,31.1399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,raw,12.019999999996799
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,ipcq,32.6399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,raw,13.019999999996799
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,ipcq,34.1399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,raw,14.019999999996799
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,ipcq,35.6399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,raw,15.019999999996799
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,ipcq,38.6399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,raw,17.0199999999968
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,ipcq,41.6399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,raw,19.0199999999968
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,ipcq,53.6399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,raw,27.0199999999968
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,ipcq,77.6399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,raw,43.0199999999968
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,ipcq,125.64000000000306
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,raw,75.02000000000407
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,ipcq,149.64000000000306
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,raw,91.02000000000407
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,ipcq,31.1399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,raw,12.019999999996799
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,ipcq,32.6399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,raw,13.019999999996799
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,ipcq,34.1399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,raw,14.019999999996799
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,ipcq,35.6399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,raw,15.019999999996799
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,ipcq,38.6399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,raw,17.0199999999968
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,ipcq,41.6399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,raw,19.0199999999968
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,ipcq,53.6399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,raw,27.0199999999968
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,ipcq,77.6399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,raw,43.0199999999968
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,ipcq,125.64000000000306
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,raw,75.02000000000407
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,ipcq,149.64000000000306
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,raw,91.02000000000407
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,ipcq,67.15999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,raw,68.53999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,ipcq,68.65999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,raw,70.03999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,ipcq,70.15999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,raw,71.53999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,ipcq,71.65999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,raw,73.03999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,ipcq,74.65999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,raw,76.03999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,ipcq,77.65999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,raw,79.03999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,ipcq,89.65999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,raw,91.03999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,ipcq,113.65999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,raw,115.03999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,ipcq,161.65999999999985
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,raw,163.04000000000087
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,ipcq,185.65999999999985
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,raw,187.04000000000087
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,ipcq,87.15999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,raw,88.53999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,ipcq,88.65999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,raw,90.03999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,ipcq,90.15999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,raw,91.53999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,ipcq,91.65999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,raw,93.03999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,ipcq,94.65999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,raw,96.03999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,ipcq,97.65999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,raw,99.03999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,ipcq,109.65999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,raw,111.03999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,ipcq,133.65999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,raw,135.03999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,ipcq,181.65999999999985
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,raw,183.04000000000087
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,ipcq,205.65999999999985
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,raw,207.04000000000087
hop,label,size_bytes,path,total_ns
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,ipcq,31.6399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,raw,12.019999999996799
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,ipcq,33.6399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,raw,13.019999999996799
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,ipcq,35.6399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,raw,14.019999999996799
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,ipcq,37.6399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,raw,15.019999999996799
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,ipcq,41.6399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,raw,17.0199999999968
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,ipcq,45.6399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,raw,19.0199999999968
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,ipcq,61.6399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,raw,27.0199999999968
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,ipcq,93.6399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,raw,43.0199999999968
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,ipcq,157.64000000000306
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,raw,75.02000000000407
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,ipcq,189.64000000000306
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,raw,91.02000000000407
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,ipcq,31.6399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,raw,12.019999999996799
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,ipcq,33.6399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,raw,13.019999999996799
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,ipcq,35.6399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,raw,14.019999999996799
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,ipcq,37.6399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,raw,15.019999999996799
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,ipcq,41.6399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,raw,17.0199999999968
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,ipcq,45.6399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,raw,19.0199999999968
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,ipcq,61.6399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,raw,27.0199999999968
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,ipcq,93.6399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,raw,43.0199999999968
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,ipcq,157.64000000000306
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,raw,75.02000000000407
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,ipcq,189.64000000000306
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,raw,91.02000000000407
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,ipcq,67.65999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,raw,68.53999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,ipcq,69.65999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,raw,70.03999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,ipcq,71.65999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,raw,71.53999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,ipcq,73.65999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,raw,73.03999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,ipcq,77.65999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,raw,76.03999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,ipcq,81.65999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,raw,79.03999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,ipcq,97.65999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,raw,91.03999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,ipcq,129.65999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,raw,115.03999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,ipcq,193.65999999999985
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,raw,163.04000000000087
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,ipcq,225.65999999999985
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,raw,187.04000000000087
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,ipcq,87.65999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,raw,88.53999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,ipcq,89.65999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,raw,90.03999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,ipcq,91.65999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,raw,91.53999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,ipcq,93.65999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,raw,93.03999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,ipcq,97.65999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,raw,96.03999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,ipcq,101.65999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,raw,99.03999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,ipcq,117.65999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,raw,111.03999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,ipcq,149.65999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,raw,135.03999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,ipcq,213.65999999999985
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,raw,183.04000000000087
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,ipcq,245.65999999999985
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,raw,207.04000000000087
1 hop label size_bytes path total_ns
2 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 128 ipcq 31.1399999999976 31.6399999999976
3 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 128 raw 12.019999999996799
4 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 256 ipcq 32.6399999999976 33.6399999999976
5 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 256 raw 13.019999999996799
6 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 384 ipcq 34.1399999999976 35.6399999999976
7 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 384 raw 14.019999999996799
8 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 512 ipcq 35.6399999999976 37.6399999999976
9 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 512 raw 15.019999999996799
10 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 768 ipcq 38.6399999999976 41.6399999999976
11 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 768 raw 17.0199999999968
12 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 1024 ipcq 41.6399999999976 45.6399999999976
13 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 1024 raw 19.0199999999968
14 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 2048 ipcq 53.6399999999976 61.6399999999976
15 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 2048 raw 27.0199999999968
16 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 4096 ipcq 77.6399999999976 93.6399999999976
17 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 4096 raw 43.0199999999968
18 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 8192 ipcq 125.64000000000306 157.64000000000306
19 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 8192 raw 75.02000000000407
20 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 10240 ipcq 149.64000000000306 189.64000000000306
21 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 10240 raw 91.02000000000407
22 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 128 ipcq 31.1399999999976 31.6399999999976
23 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 128 raw 12.019999999996799
24 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 256 ipcq 32.6399999999976 33.6399999999976
25 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 256 raw 13.019999999996799
26 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 384 ipcq 34.1399999999976 35.6399999999976
27 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 384 raw 14.019999999996799
28 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 512 ipcq 35.6399999999976 37.6399999999976
29 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 512 raw 15.019999999996799
30 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 768 ipcq 38.6399999999976 41.6399999999976
31 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 768 raw 17.0199999999968
32 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 1024 ipcq 41.6399999999976 45.6399999999976
33 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 1024 raw 19.0199999999968
34 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 2048 ipcq 53.6399999999976 61.6399999999976
35 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 2048 raw 27.0199999999968
36 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 4096 ipcq 77.6399999999976 93.6399999999976
37 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 4096 raw 43.0199999999968
38 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 8192 ipcq 125.64000000000306 157.64000000000306
39 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 8192 raw 75.02000000000407
40 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 10240 ipcq 149.64000000000306 189.64000000000306
41 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 10240 raw 91.02000000000407
42 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 128 ipcq 67.15999999999804 67.65999999999804
43 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 128 raw 68.53999999999724
44 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 256 ipcq 68.65999999999804 69.65999999999804
45 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 256 raw 70.03999999999724
46 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 384 ipcq 70.15999999999804 71.65999999999804
47 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 384 raw 71.53999999999724
48 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 512 ipcq 71.65999999999804 73.65999999999804
49 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 512 raw 73.03999999999724
50 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 768 ipcq 74.65999999999804 77.65999999999804
51 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 768 raw 76.03999999999724
52 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 1024 ipcq 77.65999999999804 81.65999999999804
53 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 1024 raw 79.03999999999724
54 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 2048 ipcq 89.65999999999804 97.65999999999804
55 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 2048 raw 91.03999999999724
56 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 4096 ipcq 113.65999999999804 129.65999999999804
57 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 4096 raw 115.03999999999724
58 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 8192 ipcq 161.65999999999985 193.65999999999985
59 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 8192 raw 163.04000000000087
60 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 10240 ipcq 185.65999999999985 225.65999999999985
61 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 10240 raw 187.04000000000087
62 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 128 ipcq 87.15999999999804 87.65999999999804
63 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 128 raw 88.53999999999724
64 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 256 ipcq 88.65999999999804 89.65999999999804
65 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 256 raw 90.03999999999724
66 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 384 ipcq 90.15999999999804 91.65999999999804
67 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 384 raw 91.53999999999724
68 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 512 ipcq 91.65999999999804 93.65999999999804
69 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 512 raw 93.03999999999724
70 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 768 ipcq 94.65999999999804 97.65999999999804
71 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 768 raw 96.03999999999724
72 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 1024 ipcq 97.65999999999804 101.65999999999804
73 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 1024 raw 99.03999999999724
74 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 2048 ipcq 109.65999999999804 117.65999999999804
75 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 2048 raw 111.03999999999724
76 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 4096 ipcq 133.65999999999804 149.65999999999804
77 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 4096 raw 135.03999999999724
78 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 8192 ipcq 181.65999999999985 213.65999999999985
79 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 8192 raw 183.04000000000087
80 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 10240 ipcq 205.65999999999985 245.65999999999985
81 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 10240 raw 207.04000000000087
@@ -109,6 +109,11 @@ def allreduce_intercube_multidevice(
):
"""Intercube all-reduce (pe0-only) with configurable SIP topology.
Root cube sits at the geometric center (cube_w//2, cube_h//2) and
each phase converges bidirectionally so the intra-SIP critical path
is ~half what a corner-root walk would be (e.g., 4×4 mesh: 4 hops
reduce + 4 hops broadcast vs 6+6 with corner root).
Args:
t_ptr: VA base of the row-wise-sharded tensor on this SIP.
n_elem: f16 elements per cube tile.
@@ -127,6 +132,10 @@ def allreduce_intercube_multidevice(
nbytes = n_elem * 2
single_cube = (cube_w == 1 and cube_h == 1)
root_col = cube_w // 2
root_row = cube_h // 2
root_cube = root_row * cube_w + root_col
pe_addr = t_ptr + cube_id * nbytes
acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
@@ -143,33 +152,55 @@ def allreduce_intercube_multidevice(
acc = _inter_sip_mesh_2d(
acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
else:
# ── Multi-cube mode: full mesh reduce + inter-SIP + broadcast ──
# ── Multi-cube mode: center-root bidirectional reduce
# + inter-SIP exchange + bidirectional broadcast ──
# Phase 1: row reduce W → E
if col == 0:
# Phase 1: row reduce — converge at col == root_col.
# Left half (col < root_col) walks W→E; right half (col > root_col)
# walks E→W; the root_col cube merges both sides.
if col == 0 and root_col > 0:
tl.send(dir="E", src=acc)
elif col < cube_w - 1:
elif 0 < col < root_col:
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
acc = acc + recv
tl.send(dir="E", src=acc)
else:
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
elif col == root_col:
if root_col > 0:
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
acc = acc + recv
if cube_w - 1 > root_col:
recv = tl.recv(dir="E", shape=(n_elem,), dtype="f16")
acc = acc + recv
elif root_col < col < cube_w - 1:
recv = tl.recv(dir="E", shape=(n_elem,), dtype="f16")
acc = acc + recv
tl.send(dir="W", src=acc)
elif col == cube_w - 1 and cube_w - 1 > root_col:
tl.send(dir="W", src=acc)
# Phase 2: col reduce N → S on rightmost column
if col == cube_w - 1:
if row == 0:
# Phase 2: col reduce on col == root_col — converge at row == root_row.
if col == root_col:
if row == 0 and root_row > 0:
tl.send(dir="S", src=acc)
elif row < cube_h - 1:
elif 0 < row < root_row:
recv = tl.recv(dir="N", shape=(n_elem,), dtype="f16")
acc = acc + recv
tl.send(dir="S", src=acc)
else:
recv = tl.recv(dir="N", shape=(n_elem,), dtype="f16")
elif row == root_row:
if root_row > 0:
recv = tl.recv(dir="N", shape=(n_elem,), dtype="f16")
acc = acc + recv
if cube_h - 1 > root_row:
recv = tl.recv(dir="S", shape=(n_elem,), dtype="f16")
acc = acc + recv
elif root_row < row < cube_h - 1:
recv = tl.recv(dir="S", shape=(n_elem,), dtype="f16")
acc = acc + recv
tl.send(dir="N", src=acc)
elif row == cube_h - 1 and cube_h - 1 > root_row:
tl.send(dir="N", src=acc)
# Phase 3: inter-SIP exchange on root cube
root_cube = (cube_h - 1) * cube_w + (cube_w - 1)
# Phase 3: inter-SIP exchange on root cube.
if cube_id == root_cube and n_sips > 1:
if sip_topo_kind == SIP_TOPO_RING:
acc = _inter_sip_ring(acc, n_sips, n_elem, tl)
@@ -180,24 +211,36 @@ def allreduce_intercube_multidevice(
acc = _inter_sip_mesh_2d(
acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
# Phase 4: col broadcast S → N on rightmost column
if col == cube_w - 1:
if row == cube_h - 1:
tl.send(dir="N", src=acc)
elif row > 0:
acc = tl.recv(dir="S", shape=(n_elem,), dtype="f16")
tl.send(dir="N", src=acc)
else:
# Phase 4: col broadcast on col == root_col, outward from root_row.
if col == root_col:
if row == root_row:
if root_row > 0:
tl.send(dir="N", src=acc)
if cube_h - 1 > root_row:
tl.send(dir="S", src=acc)
elif row < root_row:
acc = tl.recv(dir="S", shape=(n_elem,), dtype="f16")
if row > 0:
tl.send(dir="N", src=acc)
elif row > root_row:
acc = tl.recv(dir="N", shape=(n_elem,), dtype="f16")
if row < cube_h - 1:
tl.send(dir="S", src=acc)
# Phase 5: row broadcast E → W
if col == cube_w - 1:
tl.send(dir="W", src=acc)
elif col > 0:
acc = tl.recv(dir="E", shape=(n_elem,), dtype="f16")
tl.send(dir="W", src=acc)
else:
# Phase 5: row broadcast outward from root_col.
if col == root_col:
if root_col > 0:
tl.send(dir="W", src=acc)
if cube_w - 1 > root_col:
tl.send(dir="E", src=acc)
elif col < root_col:
acc = tl.recv(dir="E", shape=(n_elem,), dtype="f16")
if col > 0:
tl.send(dir="W", src=acc)
elif col > root_col:
acc = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
if col < cube_w - 1:
tl.send(dir="E", src=acc)
tl.store(pe_addr, acc)
+20
View File
@@ -31,6 +31,26 @@ class IpcqInvalidDirection(ValueError):
has no neighbor installed for this PE."""
# ── ADR-0023 D9.7: IPCQ slot-memory latency model ───────────────────
#
# Per-tier (bw_gbs, overhead_ns) used to charge the slot write (inbound)
# and slot read (recv consume). Mirrors topology.yaml component values.
_BUFFER_KIND_BW: dict[str, tuple[float, float]] = {
"tcm": (512.0, 0.0),
"sram": (128.0, 2.0),
"hbm": (32.0, 6.0),
}
def slot_io_latency_ns(buffer_kind: str, nbytes: int) -> float:
"""Per-access latency for one slot read/write of ``nbytes`` against
the IPCQ backing memory tier (``buffer_kind``)."""
bw_gbs, overhead_ns = _BUFFER_KIND_BW.get(
buffer_kind, _BUFFER_KIND_BW["tcm"],
)
return float(nbytes) / bw_gbs + overhead_ns
# ── D2.5: IpcqEndpoint ───────────────────────────────────────────────
@@ -219,6 +219,16 @@ class PeDmaComponent(PeEngineBase):
token = txn.request
# ADR-0023 D9.7: charge IPCQ slot-WRITE latency against the
# backing-memory tier (tcm/sram/hbm) before the atomic block.
# Must come BEFORE the atomic write→IpcqMetaArrival pair (I6).
from kernbench.common.ipcq_types import slot_io_latency_ns
slot_write_ns = slot_io_latency_ns(
token.dst_endpoint.buffer_kind, token.nbytes,
)
if slot_write_ns > 0:
yield env.timeout(slot_write_ns)
# ── ATOMIC: do not introduce yield between these two operations ──
# 1. Move data via MemoryStore (single-hop DMA write).
# Prefer the in-flight snapshot stashed by the sender PE_DMA;
@@ -329,6 +329,16 @@ class PeIpcqComponent(ComponentBase):
qp["my_tail"] += 1
# ADR-0023 D9.7: charge IPCQ slot-READ latency against the
# backing-memory tier (tcm/sram/hbm). Recv blocks for the
# kernel-side slot consume; pe_exec_ns reflects this cost.
from kernbench.common.ipcq_types import slot_io_latency_ns
slot_read_ns = slot_io_latency_ns(
self._buffer_kind, req.result_data.get("nbytes", 0),
)
if slot_read_ns > 0:
yield env.timeout(slot_read_ns)
# Diagnostics trace (D14)
from kernbench.ccl import diagnostics
if diagnostics.trace_enabled():
Binary file not shown.

Before

Width:  |  Height:  |  Size: 41 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 87 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 39 KiB

-37
View File
@@ -1,37 +0,0 @@
algorithm,sip_topology,n_sips,n_elem,bytes_per_pe,bytes_per_sip,latency_ns
intercube_allreduce,mesh_2d_no_wrap,6,8,16,256,3508.4249999999993
intercube_allreduce,mesh_2d_no_wrap,6,32,64,1024,3515.55
intercube_allreduce,mesh_2d_no_wrap,6,64,128,2048,3525.0499999999975
intercube_allreduce,mesh_2d_no_wrap,6,128,256,4096,3544.049999999992
intercube_allreduce,mesh_2d_no_wrap,6,512,1024,16384,3667.049999999992
intercube_allreduce,mesh_2d_no_wrap,6,1024,2048,32768,3837.049999999992
intercube_allreduce,mesh_2d_no_wrap,6,2048,4096,65536,4177.049999999992
intercube_allreduce,mesh_2d_no_wrap,6,4096,8192,131072,4857.049999999959
intercube_allreduce,mesh_2d_no_wrap,6,8192,16384,262144,6217.049999999945
intercube_allreduce,mesh_2d_no_wrap,6,16384,32768,524288,8937.049999999937
intercube_allreduce,mesh_2d_no_wrap,6,32768,65536,1048576,14377.049999999872
intercube_allreduce,mesh_2d_no_wrap,6,49152,98304,1572864,19817.049999999872
intercube_allreduce,ring_1d,6,8,16,256,3073.1299999999937
intercube_allreduce,ring_1d,6,32,64,1024,3079.8799999999947
intercube_allreduce,ring_1d,6,64,128,2048,3088.879999999992
intercube_allreduce,ring_1d,6,128,256,4096,3106.8799999999865
intercube_allreduce,ring_1d,6,512,1024,16384,3225.8799999999865
intercube_allreduce,ring_1d,6,1024,2048,32768,3391.8799999999865
intercube_allreduce,ring_1d,6,2048,4096,65536,3723.8799999999865
intercube_allreduce,ring_1d,6,4096,8192,131072,4387.879999999965
intercube_allreduce,ring_1d,6,8192,16384,262144,5715.879999999957
intercube_allreduce,ring_1d,6,16384,32768,524288,8371.879999999932
intercube_allreduce,ring_1d,6,32768,65536,1048576,13683.879999999903
intercube_allreduce,ring_1d,6,49152,98304,1572864,18995.879999999917
intercube_allreduce,torus_2d,6,8,16,256,2190.4799999999923
intercube_allreduce,torus_2d,6,32,64,1024,2196.479999999993
intercube_allreduce,torus_2d,6,64,128,2048,2204.4799999999905
intercube_allreduce,torus_2d,6,128,256,4096,2220.479999999985
intercube_allreduce,torus_2d,6,512,1024,16384,2325.479999999985
intercube_allreduce,torus_2d,6,1024,2048,32768,2471.479999999985
intercube_allreduce,torus_2d,6,2048,4096,65536,2763.479999999985
intercube_allreduce,torus_2d,6,4096,8192,131072,3347.4799999999777
intercube_allreduce,torus_2d,6,8192,16384,262144,4515.4799999999705
intercube_allreduce,torus_2d,6,16384,32768,524288,6851.479999999952
intercube_allreduce,torus_2d,6,32768,65536,1048576,11523.479999999923
intercube_allreduce,torus_2d,6,49152,98304,1572864,16195.479999999952
1 algorithm sip_topology n_sips n_elem bytes_per_pe bytes_per_sip latency_ns
2 intercube_allreduce mesh_2d_no_wrap 6 8 16 256 3508.4249999999993
3 intercube_allreduce mesh_2d_no_wrap 6 32 64 1024 3515.55
4 intercube_allreduce mesh_2d_no_wrap 6 64 128 2048 3525.0499999999975
5 intercube_allreduce mesh_2d_no_wrap 6 128 256 4096 3544.049999999992
6 intercube_allreduce mesh_2d_no_wrap 6 512 1024 16384 3667.049999999992
7 intercube_allreduce mesh_2d_no_wrap 6 1024 2048 32768 3837.049999999992
8 intercube_allreduce mesh_2d_no_wrap 6 2048 4096 65536 4177.049999999992
9 intercube_allreduce mesh_2d_no_wrap 6 4096 8192 131072 4857.049999999959
10 intercube_allreduce mesh_2d_no_wrap 6 8192 16384 262144 6217.049999999945
11 intercube_allreduce mesh_2d_no_wrap 6 16384 32768 524288 8937.049999999937
12 intercube_allreduce mesh_2d_no_wrap 6 32768 65536 1048576 14377.049999999872
13 intercube_allreduce mesh_2d_no_wrap 6 49152 98304 1572864 19817.049999999872
14 intercube_allreduce ring_1d 6 8 16 256 3073.1299999999937
15 intercube_allreduce ring_1d 6 32 64 1024 3079.8799999999947
16 intercube_allreduce ring_1d 6 64 128 2048 3088.879999999992
17 intercube_allreduce ring_1d 6 128 256 4096 3106.8799999999865
18 intercube_allreduce ring_1d 6 512 1024 16384 3225.8799999999865
19 intercube_allreduce ring_1d 6 1024 2048 32768 3391.8799999999865
20 intercube_allreduce ring_1d 6 2048 4096 65536 3723.8799999999865
21 intercube_allreduce ring_1d 6 4096 8192 131072 4387.879999999965
22 intercube_allreduce ring_1d 6 8192 16384 262144 5715.879999999957
23 intercube_allreduce ring_1d 6 16384 32768 524288 8371.879999999932
24 intercube_allreduce ring_1d 6 32768 65536 1048576 13683.879999999903
25 intercube_allreduce ring_1d 6 49152 98304 1572864 18995.879999999917
26 intercube_allreduce torus_2d 6 8 16 256 2190.4799999999923
27 intercube_allreduce torus_2d 6 32 64 1024 2196.479999999993
28 intercube_allreduce torus_2d 6 64 128 2048 2204.4799999999905
29 intercube_allreduce torus_2d 6 128 256 4096 2220.479999999985
30 intercube_allreduce torus_2d 6 512 1024 16384 2325.479999999985
31 intercube_allreduce torus_2d 6 1024 2048 32768 2471.479999999985
32 intercube_allreduce torus_2d 6 2048 4096 65536 2763.479999999985
33 intercube_allreduce torus_2d 6 4096 8192 131072 3347.4799999999777
34 intercube_allreduce torus_2d 6 8192 16384 262144 4515.4799999999705
35 intercube_allreduce torus_2d 6 16384 32768 524288 6851.479999999952
36 intercube_allreduce torus_2d 6 32768 65536 1048576 11523.479999999923
37 intercube_allreduce torus_2d 6 49152 98304 1572864 16195.479999999952
Binary file not shown.

Before

Width:  |  Height:  |  Size: 39 KiB

+21 -17
View File
@@ -27,23 +27,27 @@ def pytest_sessionfinish(session, exitstatus):
import sys
from pathlib import Path
mod_path = Path(__file__).parent / "test_allreduce_multidevice.py"
if not mod_path.exists():
return
spec = importlib.util.spec_from_file_location(
"_test_allreduce_multidevice_for_aggregate", mod_path,
)
if spec is None or spec.loader is None:
return
mod = importlib.util.module_from_spec(spec)
sys.modules[spec.name] = mod
try:
spec.loader.exec_module(mod)
agg = getattr(mod, "_aggregate_sweep_plots", None)
if agg is not None:
agg()
except Exception as e:
print(f"[conftest] sweep aggregation failed: {e}")
def _exec(name: str, attr: str) -> None:
mod_path = Path(__file__).parent / name
if not mod_path.exists():
return
s = importlib.util.spec_from_file_location(
f"_{name.removesuffix('.py')}_for_aggregate", mod_path,
)
if s is None or s.loader is None:
return
mod = importlib.util.module_from_spec(s)
sys.modules[s.name] = mod
try:
s.loader.exec_module(mod)
fn = getattr(mod, attr, None)
if fn is not None:
fn()
except Exception as e:
print(f"[conftest] aggregator {attr}() in {name} failed: {e}")
_exec("test_allreduce_multidevice.py", "_aggregate_sweep_plots")
_exec("test_allreduce_buffer_kind_sweep.py", "aggregate_buffer_kind_plot")
@pytest.fixture(scope="session")
Binary file not shown.

Before

Width:  |  Height:  |  Size: 48 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 48 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 51 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 50 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 100 KiB

+196
View File
@@ -0,0 +1,196 @@
"""Phase 1 buffer-kind allreduce sweep — torus_2d 6 SIPs.
Parametrized over (buffer_kind, n_elem). Each case runs the standard
config-driven allreduce app and writes a JSON row to a shared staging
dir; the conftest sessionfinish hook (added in Phase 1) aggregates
rows into ``docs/diagrams/allreduce_latency_plots/buffer_kind_sweep.png``.
Pre-Phase-2: the three buffer-kind lines overlap exactly because slot
access is latency-free today. Post-Phase-2 they spread out (tcm
fastest, hbm slowest).
"""
from __future__ import annotations
import json
from pathlib import Path
import pytest
import yaml
from kernbench.runtime_api.context import RuntimeContext
from kernbench.runtime_api.types import DeviceSelector
from kernbench.sim_engine.engine import GraphEngine
from kernbench.topology.builder import resolve_topology
# Reuse the allreduce app helpers.
from tests.test_allreduce_multidevice import (
_write_temp_configs,
run_allreduce,
)
_BUFFER_KINDS = ["tcm", "sram", "hbm"]
_N_ELEM_GRID = [128, 1024, 8192, 32768] # 256 B → 64 KB per slot
_ELEM_BYTES_F16 = 2
_OUT_DIR = (Path(__file__).parent.parent / "docs" / "diagrams"
/ "allreduce_latency_plots")
_ROWS_DIR = _OUT_DIR / "_buffer_kind_rows"
def _bk_params():
out = []
for bk in _BUFFER_KINDS:
for n_elem in _N_ELEM_GRID:
out.append(pytest.param(bk, n_elem, id=f"{bk}-n_elem{n_elem}"))
return out
@pytest.mark.parametrize("buffer_kind,n_elem", _bk_params())
def test_buffer_kind_allreduce_one(tmp_path, buffer_kind, n_elem):
"""One config of the buffer-kind sweep. xdist parallelizes."""
sub = tmp_path / f"{buffer_kind}_{n_elem}"
sub.mkdir()
topo_path, ccl_path = _write_temp_configs(
sub,
sip_topology="torus_2d",
n_sips=6,
algorithm="intercube_allreduce",
sip_w=3, sip_h=2,
n_elem_override=n_elem,
)
# Override buffer_kind in the temp ccl.yaml.
with open(ccl_path) as f:
ccl_cfg = yaml.safe_load(f)
ccl_cfg.setdefault("defaults", {})["buffer_kind"] = buffer_kind
ccl_cfg.setdefault("algorithms", {}).setdefault(
"intercube_allreduce", {},
)["buffer_kind"] = buffer_kind
with open(ccl_path, "w") as f:
yaml.dump(ccl_cfg, f, default_flow_style=False)
topo = resolve_topology(topo_path)
engine = GraphEngine(topo.topology_obj, enable_data=True)
spec = topo.topology_obj.spec
with RuntimeContext(
engine=engine,
target_device=DeviceSelector("all"),
correlation_id=f"bk_sweep_{buffer_kind}_{n_elem}",
spec=spec,
) as ctx:
result = run_allreduce(
ctx, engine, spec,
algorithm="intercube_allreduce", ccl_yaml=ccl_path,
)
assert result["ok_cubes"] > 0
pe_exec_vals = [
float(tr.get("pe_exec_ns", 0.0) or 0.0)
for _, (_, tr) in engine._results.items()
if isinstance(tr, dict)
]
crit_ns = max(pe_exec_vals) if pe_exec_vals else 0.0
bytes_per_pe = n_elem * _ELEM_BYTES_F16
record = {
"buffer_kind": buffer_kind,
"sip_topology": "torus_2d",
"n_sips": 6,
"n_elem": n_elem,
"bytes_per_pe": bytes_per_pe,
"latency_ns": crit_ns,
}
_ROWS_DIR.mkdir(parents=True, exist_ok=True)
row_path = _ROWS_DIR / f"{buffer_kind}_{n_elem}.json"
with open(row_path, "w", encoding="utf-8") as f:
json.dump(record, f)
def aggregate_buffer_kind_plot() -> bool:
"""Read per-config rows and emit buffer_kind_sweep.png + CSV.
Called from conftest.pytest_sessionfinish (controller-only).
Returns True if rows were aggregated.
"""
import csv
if not _ROWS_DIR.exists():
return False
row_files = sorted(_ROWS_DIR.glob("*.json"))
if not row_files:
return False
records = []
for p in row_files:
with open(p, encoding="utf-8") as f:
records.append(json.load(f))
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
def _fmt_bytes(x, _pos):
if x <= 0:
return "0"
if x >= 1024 * 1024:
return f"{x / (1024 * 1024):.0f} MB"
if x >= 1024:
return f"{x / 1024:.0f} KB"
return f"{x:.0f} B"
_bytes_fmt = FuncFormatter(_fmt_bytes)
_OUT_DIR.mkdir(parents=True, exist_ok=True)
with open(_OUT_DIR / "buffer_kind_sweep.csv", "w",
newline="", encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=[
"buffer_kind", "sip_topology", "n_sips", "n_elem",
"bytes_per_pe", "latency_ns",
])
w.writeheader()
for r in sorted(records, key=lambda r: (
r["buffer_kind"], r["bytes_per_pe"],
)):
w.writerow(r)
colors = {"tcm": "tab:blue", "sram": "tab:orange", "hbm": "tab:red"}
fig, ax = plt.subplots(figsize=(10, 6))
for bk in ["tcm", "sram", "hbm"]:
rs = sorted(
[r for r in records if r["buffer_kind"] == bk],
key=lambda r: r["bytes_per_pe"],
)
if not rs:
continue
ax.plot(
[r["bytes_per_pe"] for r in rs],
[r["latency_ns"] for r in rs],
marker="o", lw=2.0,
color=colors[bk], label=f"buffer_kind = {bk}",
)
ax.set_xscale("log", base=2)
ax.set_xlabel("Bytes per PE (log scale)")
ax.set_ylabel("Time (ns)")
ax.set_title(
"Allreduce torus_2d (6 SIPs, 3×2) — IPCQ slot memory tier"
)
ax.grid(True, alpha=0.3)
ax.legend()
ax.xaxis.set_major_formatter(_bytes_fmt)
fig.tight_layout()
fig.savefig(_OUT_DIR / "buffer_kind_sweep.png", dpi=130)
plt.close(fig)
for p in row_files:
try:
p.unlink()
except OSError:
pass
try:
_ROWS_DIR.rmdir()
except OSError:
pass
print(f"\nWrote {_OUT_DIR / 'buffer_kind_sweep.png'} "
f"from {len(records)} rows")
return True
+40 -54
View File
@@ -289,7 +289,8 @@ _SWEEP_TOPOLOGIES = [
# parametrized invocation writes one JSON file here; the aggregator
# (run from conftest.pytest_sessionfinish) reads them and emits the
# combined CSV + PNG plots.
_SWEEP_OUT_DIR = Path(__file__).parent / "allreduce_latency_plots"
_SWEEP_OUT_DIR = (Path(__file__).parent.parent / "docs" / "diagrams"
/ "allreduce_latency_plots")
_SWEEP_ROWS_DIR = _SWEEP_OUT_DIR / "_rows"
@@ -447,7 +448,7 @@ def _aggregate_sweep_plots() -> bool:
ax.plot(xs, ys, marker="o", color="tab:blue")
ax.set_xscale("log", base=2)
ax.set_xlabel("Bytes per PE (log scale)")
ax.set_ylabel("max pe_exec_ns (critical path)")
ax.set_ylabel("Time (ns)")
ax.set_title(title)
ax.grid(True, alpha=0.3)
ax.xaxis.set_major_formatter(_bytes_fmt)
@@ -457,7 +458,28 @@ def _aggregate_sweep_plots() -> bool:
colors = {"ring_1d": "tab:blue", "torus_2d": "tab:orange",
"mesh_2d_no_wrap": "tab:green"}
THEORETICAL_TORUS_2D_6SIP_NS = 10600.0
# ── Hand-derived theoretical model for torus_2d (6 SIPs) ──
# Critical-path analysis (per packet, packet = 128 B at NoC):
# local intra-SIP reduce + broadcast = 8 hops × 57 ns = 456 ns
# global X-direction reduce = 5 UCIe + 1 UAL = 445 ns
# global Y-direction reduce = 5 UCIe + 1 UAL = 445 ns
# per-packet startup latency = 456 + 445 + 445 = 1346 ns
# Packet count is PER CUBE (8 PEs/cube cooperate on the cube tile).
# At 6144 packets/cube the pipelined total is 8741 ns, so the
# bottleneck-stage interval τ = (8741 1346) / (6144 1) ≈ 1.204 ns.
# T_theoretical(N) = 1346 + (N 1) × τ
# where N = ceil((bytes_per_pe × 8) / 128) = ceil(bytes_per_pe / 16)
NOC_PACKET_BYTES = 128
PES_PER_CUBE = 8
T_STARTUP_NS = 1346.0
TAU_NS = (8741.0 - 1346.0) / (6144 - 1) # ≈ 1.2038 ns/packet
def _theoretical_torus_2d_ns(bytes_per_pe: int) -> float:
bytes_per_cube = int(bytes_per_pe) * PES_PER_CUBE
n_packets = max(1, -(-bytes_per_cube // NOC_PACKET_BYTES)) # ceil
return T_STARTUP_NS + (n_packets - 1) * TAU_NS
fig, ax = plt.subplots(figsize=(9, 6))
for topo_name in topologies:
rs = sorted(
@@ -473,64 +495,28 @@ def _aggregate_sweep_plots() -> bool:
label=f"{topo_name} (n_sips={rs[0]['n_sips']})",
color=colors.get(topo_name),
)
ax.axhline(
y=THEORETICAL_TORUS_2D_6SIP_NS,
color="tab:red", linestyle="--", linewidth=1.5,
label=f"theoretical torus_2d (6 SIPs) = "
f"{THEORETICAL_TORUS_2D_6SIP_NS:.0f} ns",
# Theoretical torus_2d curve across all payload sizes.
torus_rs = sorted(
[r for r in records if r["sip_topology"] == "torus_2d"],
key=lambda r: r["bytes_per_pe"],
)
BYTES_96KB = 96 * 1024
ax.axvline(
x=BYTES_96KB, ymin=0, ymax=1,
color="tab:red", linestyle=":", linewidth=1.2,
)
ax.plot(
[BYTES_96KB], [THEORETICAL_TORUS_2D_6SIP_NS],
marker="x", color="tab:red", markersize=10, markeredgewidth=2,
)
# Find simulated torus_2d latency at 96 KB (if present) for direct
# comparison with the theoretical value.
sim_torus_at_96kb = next(
(r["latency_ns"] for r in records
if r["sip_topology"] == "torus_2d" and r["bytes_per_pe"] == BYTES_96KB),
None,
)
if sim_torus_at_96kb is not None:
if torus_rs:
xs_th = [r["bytes_per_pe"] for r in torus_rs]
ys_th = [_theoretical_torus_2d_ns(r["bytes_per_pe"]) for r in torus_rs]
ax.plot(
[BYTES_96KB], [sim_torus_at_96kb],
marker="o", color="tab:orange",
markersize=10, markeredgecolor="black", markeredgewidth=1.2,
)
ax.annotate(
f"96 KB\n"
f"theoretical = {THEORETICAL_TORUS_2D_6SIP_NS:.0f} ns\n"
f"simulated = {sim_torus_at_96kb:.0f} ns",
xy=(BYTES_96KB, sim_torus_at_96kb),
xytext=(10, -20), textcoords="offset points",
color="tab:red", fontsize=9,
)
else:
ax.annotate(
f"96 KB\n→ theoretical {THEORETICAL_TORUS_2D_6SIP_NS:.0f} ns",
xy=(BYTES_96KB, THEORETICAL_TORUS_2D_6SIP_NS),
xytext=(8, -20), textcoords="offset points",
color="tab:red", fontsize=9,
xs_th, ys_th,
color="tab:red", linestyle="--", linewidth=1.6, marker="x",
label="theoretical torus_2d (6 SIPs)",
)
ax.set_xscale("log", base=2)
ax.set_xlabel("Bytes per PE (log scale)")
ax.set_ylabel("max pe_exec_ns (critical path)")
ax.set_ylabel("Time (ns)")
ax.set_title("Multi-device allreduce latency by topology")
ax.grid(True, alpha=0.3)
# Drop 128 KB tick (overlaps visually with the explicit 96 KB marker)
# and add 96 KB.
BYTES_128KB = 128 * 1024
existing_ticks = [t for t in ax.get_xticks() if int(t) != BYTES_128KB]
if BYTES_96KB not in existing_ticks:
existing_ticks.append(BYTES_96KB)
ax.set_xticks(sorted(existing_ticks))
ax.set_xlim(left=min(r["bytes_per_pe"] for r in records) / 2,
right=BYTES_96KB * 1.5)
right=max(r["bytes_per_pe"] for r in records) * 1.5)
ax.legend()
ax.xaxis.set_major_formatter(_bytes_fmt)
fig.tight_layout()
@@ -811,7 +797,7 @@ def _draw_cube_reduction(ax):
def emit_topology_diagram() -> str:
"""Emit a 2×2-panel topology diagram into allreduce_latency_plots/.
"""Emit a 2×2-panel topology diagram into docs/diagrams/allreduce_latency_plots/.
Top row: ring_1d | torus_2d (2×3)
Bot row: mesh_2d_no_wrap (2×3) | cube-level reduction in SIP 0
+622
View File
@@ -0,0 +1,622 @@
"""High-level IPCQ + SFR connection diagram (presentation only).
Renders ``docs/diagrams/ipcq_diagram_plots/ipcq_send_recv.png`` showing one
concrete example: SIP 0 / cube 0 / pe 0 sending to pe 1 in the
``intra_E`` direction. Boxes and arrows are grounded in the actual
code paths:
- PE_IPCQ SFR fields: src/kernbench/components/builtin/pe_ipcq.py
- SFR install: src/kernbench/ccl/install.py +
src/kernbench/ccl/sfr_config.py
- PE_DMA outbound /
inbound atomic write: src/kernbench/components/builtin/pe_dma.py
This is a pure-plotting test (no simulation). It exists so the diagram
can be regenerated reproducibly alongside the rest of the suite.
"""
from __future__ import annotations
from pathlib import Path
_OUT_DIR = (Path(__file__).parent.parent / "docs" / "diagrams"
/ "ipcq_diagram_plots")
# Color palette (matches the topology diagram for visual continuity).
_BG = "#fafbfd"
_FRAME = "#3a3f4a"
_TEXT = "#1f2530"
_BLUE = "#2c6fb6"
_GREEN = "#2e8a4e"
_ORANGE = "#d3722a"
_PURPLE = "#7a4cb6"
_BOX_FILL = "#eaf2fb"
_BOX_EDGE = "#2c4a78"
_HW_FILL = "#f3ecda"
_HW_EDGE = "#a07a2a"
_MEM_FILL = "#e8f3e8"
_MEM_EDGE = "#2e8a4e"
def _box(ax, x, y, w, h, title, lines, *, fill=_BOX_FILL, edge=_BOX_EDGE,
title_color=None, font=9):
from matplotlib.patches import FancyBboxPatch
box = FancyBboxPatch(
(x, y), w, h,
boxstyle="round,pad=0.04,rounding_size=0.18",
linewidth=1.6, edgecolor=edge, facecolor=fill, zorder=2,
)
ax.add_patch(box)
ax.text(x + w / 2, y + h - 0.45, title,
ha="center", va="top", fontsize=font + 1.5,
fontweight="bold",
color=title_color or edge, zorder=3)
for i, line in enumerate(lines):
ax.text(
x + 0.25, y + h - 1.1 - i * 0.45, line,
ha="left", va="top", fontsize=font - 0.5, color=_TEXT,
family="monospace", zorder=3,
)
def _arrow(ax, xy_from, xy_to, *, color=_BLUE, lw=1.8, curve=0.0,
style="-|>", alpha=1.0, zorder=4):
from matplotlib.patches import FancyArrowPatch
arrow = FancyArrowPatch(
xy_from, xy_to,
arrowstyle=style, mutation_scale=14,
color=color, lw=lw, alpha=alpha,
connectionstyle=f"arc3,rad={curve}",
zorder=zorder,
)
ax.add_patch(arrow)
def _step_label(ax, x, y, n, text, color=_BLUE):
from matplotlib.patches import Circle
ax.add_patch(Circle((x, y), 0.28, facecolor=color, edgecolor="white",
linewidth=1.4, zorder=5))
ax.text(x, y, str(n), ha="center", va="center", fontsize=9,
fontweight="bold", color="white", zorder=6)
ax.text(x + 0.45, y, text, ha="left", va="center", fontsize=9,
color=_TEXT, zorder=6)
def emit_ipcq_diagram() -> str:
import matplotlib.pyplot as plt
from matplotlib.patches import FancyBboxPatch, Rectangle
_OUT_DIR.mkdir(parents=True, exist_ok=True)
fig, ax = plt.subplots(figsize=(18, 11), facecolor="white")
ax.set_xlim(0, 22)
ax.set_ylim(0, 14)
ax.set_aspect("equal")
ax.axis("off")
ax.set_facecolor(_BG)
# Outer panel border.
border = FancyBboxPatch(
(0.15, 0.15), 21.7, 13.7,
boxstyle="round,pad=0.02,rounding_size=0.20",
linewidth=1.4, edgecolor=_FRAME, facecolor=_BG, zorder=0,
)
ax.add_patch(border)
ax.set_title(
"IPCQ — SFR state and send/recv path between pe0 and pe1 "
"(intra_E direction, SIP 0 / cube 0)",
fontsize=14, fontweight="bold", color=_TEXT, pad=12,
)
# ── pe0 side (left half) ────────────────────────────────────────
_box(
ax, x=0.8, y=8.4, w=8.4, h=5.0,
title="pe0.pe_ipcq (SFR — direction: intra_E)",
lines=[
"neighbor_table[intra_E]:",
" peer = sip0.cube0.pe1",
" peer.rx_base_pa → pe1's intra_W slot ring",
" my_rx_base_pa → pe0's intra_E slot ring",
" n_slots = 8 slot_size = 512 B",
"",
"head/tail counters (per direction):",
" my_head # ++ on tl.send",
" my_tail # ++ on tl.recv",
" peer_head_cache # updated on IpcqMetaArrival",
" peer_tail_cache # updated on IpcqCreditMetadata",
"",
"send blocks while (my_head peer_tail_cache) ≥ n_slots",
],
edge=_BOX_EDGE, fill=_BOX_FILL,
)
_box(
ax, x=0.8, y=4.5, w=8.4, h=2.7,
title="pe0.pe_dma (outbound IPCQ driver)",
lines=[
"_handle_ipcq_outbound():",
" • snapshot src bytes from MemoryStore",
" • find fabric path → pe1.pe_dma",
" • send Transaction; do NOT wait (fire-and-forget)",
],
edge=_HW_EDGE, fill=_HW_FILL,
)
# ── pe1 side (right half) ───────────────────────────────────────
_box(
ax, x=12.8, y=8.4, w=8.4, h=5.0,
title="pe1.pe_ipcq (SFR — direction: intra_W)",
lines=[
"neighbor_table[intra_W]:",
" peer = sip0.cube0.pe0",
" peer.rx_base_pa → pe0's intra_E slot ring",
" my_rx_base_pa → pe1's intra_W slot ring",
" n_slots = 8 slot_size = 512 B",
"",
"head/tail counters (per direction):",
" my_head # ++ on tl.send (other direction)",
" my_tail # ++ on tl.recv (this direction)",
" peer_head_cache # updated on IpcqMetaArrival",
" peer_tail_cache # updated on IpcqCreditMetadata",
"",
"recv blocks while peer_head_cache ≤ my_tail",
],
edge=_BOX_EDGE, fill=_BOX_FILL,
)
_box(
ax, x=12.8, y=4.5, w=8.4, h=2.7,
title="pe1.pe_dma (inbound IPCQ driver)",
lines=[
"_handle_ipcq_inbound():",
" • pay terminal drain over fabric BW",
" • atomic: write data into pe1's intra_W slot",
" • forward IpcqMetaArrival → pe1.pe_ipcq",
],
edge=_HW_EDGE, fill=_HW_FILL,
)
# ── Slot ring buffer (under pe1.pe_dma) ─────────────────────────
ring_x0, ring_y0 = 12.8, 1.1
ring_w, ring_h = 8.4, 2.6
box = FancyBboxPatch(
(ring_x0, ring_y0), ring_w, ring_h,
boxstyle="round,pad=0.04,rounding_size=0.16",
linewidth=1.6, edgecolor=_MEM_EDGE, facecolor=_MEM_FILL, zorder=2,
)
ax.add_patch(box)
ax.text(
ring_x0 + ring_w / 2, ring_y0 + ring_h - 0.42,
"MemoryStore[buffer_kind] pe1's intra_W slot ring "
"(n_slots = 8, slot_size = 512 B)",
ha="center", va="top", fontsize=10, fontweight="bold",
color=_MEM_EDGE, zorder=3,
)
# 8 slots laid out horizontally inside the ring panel.
n_slots = 8
pad = 0.35
slot_w = (ring_w - 2 * pad) / n_slots
slot_h = 0.85
slot_y = ring_y0 + 0.3
for i in range(n_slots):
sx = ring_x0 + pad + i * slot_w
is_active = (i == 3) # Highlight one example slot
face = "#ffd9b8" if is_active else "white"
edge = _ORANGE if is_active else _MEM_EDGE
rect = Rectangle(
(sx + 0.05, slot_y), slot_w - 0.10, slot_h,
linewidth=1.2, facecolor=face, edgecolor=edge, zorder=3,
)
ax.add_patch(rect)
ax.text(
sx + slot_w / 2, slot_y + slot_h / 2,
f"s{i}", ha="center", va="center", fontsize=9,
color=_ORANGE if is_active else _TEXT,
fontweight="bold" if is_active else "normal", zorder=4,
)
ax.text(
ring_x0 + pad + 3 * slot_w + slot_w / 2, slot_y - 0.30,
"slot_idx = my_head % n_slots",
ha="center", va="top", fontsize=8, style="italic",
color=_ORANGE,
)
# ── Fabric label (between pe0.pe_dma and pe1.pe_dma) ────────────
fab = FancyBboxPatch(
(9.6, 5.0), 2.6, 1.7,
boxstyle="round,pad=0.04,rounding_size=0.20",
linewidth=1.4, edgecolor=_PURPLE, facecolor="white", zorder=2,
)
ax.add_patch(fab)
ax.text(10.9, 6.4, "Fabric", ha="center", va="center",
fontsize=11, fontweight="bold", color=_PURPLE)
ax.text(10.9, 5.7, "(NoC routers,\npe_dma → pe_dma)",
ha="center", va="center", fontsize=8, color=_TEXT)
# ── Arrows + step labels ────────────────────────────────────────
# 1. tl.send ↘ pe0.pe_ipcq
_arrow(ax, (9.2, 12.9), (9.7, 12.9), color=_BLUE) # placeholder so number lands
_step_label(ax, 0.5, 13.6,
1, "kernel calls tl.send(dir='intra_E', src_addr=X)",
color=_BLUE)
# 2. pe0.pe_ipcq → pe0.pe_dma (IpcqDmaToken)
_arrow(ax, (5.0, 8.4), (5.0, 7.2), color=_BLUE, lw=2.0)
ax.text(5.2, 7.85, "IpcqDmaToken\n"
"dst = peer.rx_base_pa + slot_idx*512",
ha="left", va="center", fontsize=8, color=_BLUE,
family="monospace")
# 3. pe0.pe_dma → fabric → pe1.pe_dma (data, fire-and-forget)
_arrow(ax, (9.2, 5.85), (9.6, 5.85), color=_BLUE, lw=2.0)
_arrow(ax, (12.2, 5.85), (12.8, 5.85), color=_BLUE, lw=2.0)
ax.text(10.9, 4.7, "data (fire-and-forget)",
ha="center", va="center", fontsize=8, style="italic",
color=_BLUE)
# 4. pe1.pe_dma → MemoryStore slot (atomic)
_arrow(ax, (17.0, 4.5), (17.0, 3.7), color=_GREEN, lw=2.0)
ax.text(17.2, 4.10, "atomic write",
ha="left", va="center", fontsize=8, color=_GREEN,
family="monospace")
# 5. pe1.pe_dma → pe1.pe_ipcq (IpcqMetaArrival)
_arrow(ax, (15.0, 7.2), (15.0, 8.4), color=_GREEN, lw=2.0)
ax.text(13.0, 7.85, "IpcqMetaArrival\n"
"→ peer_head_cache update",
ha="left", va="center", fontsize=8, color=_GREEN,
family="monospace")
# 6. tl.recv unblocks (annotation only)
_step_label(ax, 12.85, 13.6,
6, "tl.recv(dir='intra_W') unblocks; consume slot; my_tail++",
color=_GREEN)
# 7. pe1.pe_ipcq → pe0.pe_ipcq (IpcqCreditMetadata, fast-path SimPy Store)
_arrow(ax, (12.8, 11.0), (9.2, 11.0),
color=_ORANGE, lw=2.0, curve=0.18)
ax.text(11.0, 11.55,
"IpcqCreditMetadata (consumer_seq, dst_rx_base_pa)\n"
"→ pe0's credit_inbox (SimPy Store, no fabric)",
ha="center", va="center", fontsize=8, color=_ORANGE,
family="monospace")
# 8. pe0.peer_tail_cache update unblocks tl.send
ax.text(0.5, 0.55,
"Steps 13 = data path (fabric, fire-and-forget); "
"46 = receiver wake-up; 7 = credit return (fast path); "
"8 = sender unblocks when peer_tail_cache catches up.",
ha="left", va="center", fontsize=9, color=_TEXT,
style="italic")
# In-figure step legend (top, between pe0/pe1 panels).
legend_x = 9.4
legend_y = 13.5
_step_label(ax, legend_x, legend_y, 2,
"PE_IPCQ → PE_DMA (token)", color=_BLUE)
_step_label(ax, legend_x, legend_y - 0.45, 3,
"PE_DMA → fabric → PE_DMA (data)", color=_BLUE)
_step_label(ax, legend_x, legend_y - 0.90, 4,
"atomic slot write", color=_GREEN)
_step_label(ax, legend_x, legend_y - 1.35, 5,
"IpcqMetaArrival", color=_GREEN)
_step_label(ax, legend_x, legend_y - 1.80, 7,
"IpcqCreditMetadata", color=_ORANGE)
out_path = _OUT_DIR / "ipcq_send_recv.png"
fig.savefig(out_path, dpi=130, bbox_inches="tight",
facecolor=fig.get_facecolor())
import matplotlib.pyplot as _plt
_plt.close(fig)
return str(out_path)
def test_emit_ipcq_diagram():
out = emit_ipcq_diagram()
assert Path(out).exists()
# ── 2nd diagram: two-PE data + DMA + IPCQ-memory layout ──────────────
def _pe_panel(ax, x0, y0, w, h, label, *, edge=_FRAME, fill="white"):
"""Outer container for one PE: title bar + body."""
from matplotlib.patches import FancyBboxPatch
box = FancyBboxPatch(
(x0, y0), w, h,
boxstyle="round,pad=0.04,rounding_size=0.20",
linewidth=1.8, edgecolor=edge, facecolor=fill, zorder=1,
)
ax.add_patch(box)
# Title band
title_h = 0.55
band = FancyBboxPatch(
(x0 + 0.12, y0 + h - title_h - 0.10), w - 0.24, title_h,
boxstyle="round,pad=0.02,rounding_size=0.10",
linewidth=0, edgecolor="none", facecolor=edge, zorder=2,
)
ax.add_patch(band)
ax.text(
x0 + w / 2, y0 + h - title_h / 2 - 0.10, label,
ha="center", va="center", fontsize=12, fontweight="bold",
color="white", zorder=3,
)
def _sub_block(ax, cx, cy, w, h, title, body_lines, *,
fill, edge, font=9):
from matplotlib.patches import FancyBboxPatch
rect = FancyBboxPatch(
(cx - w / 2, cy - h / 2), w, h,
boxstyle="round,pad=0.02,rounding_size=0.10",
linewidth=1.4, edgecolor=edge, facecolor=fill, zorder=3,
)
ax.add_patch(rect)
ax.text(cx, cy + h / 2 - 0.30, title, ha="center", va="top",
fontsize=font + 1, fontweight="bold", color=edge, zorder=4)
for i, line in enumerate(body_lines):
ax.text(
cx, cy + h / 2 - 0.75 - i * 0.34, line,
ha="center", va="top", fontsize=font - 0.5, color=_TEXT,
family="monospace", zorder=4,
)
def _tcm_with_slots(ax, cx, cy, w, h, *, n_slots=8, active_slot=3,
title="PE_TCM (local memory)"):
"""Draw a TCM box that contains a source buffer + IPCQ slot ring."""
from matplotlib.patches import FancyBboxPatch, Rectangle
rect = FancyBboxPatch(
(cx - w / 2, cy - h / 2), w, h,
boxstyle="round,pad=0.02,rounding_size=0.10",
linewidth=1.4, edgecolor=_MEM_EDGE, facecolor=_MEM_FILL, zorder=3,
)
ax.add_patch(rect)
ax.text(
cx, cy + h / 2 - 0.28, title, ha="center", va="top",
fontsize=9.5, fontweight="bold", color=_MEM_EDGE, zorder=4,
)
# Source buffer region (left part).
src_w = (w - 0.6) * 0.30
src_h = h - 1.20
sx = cx - w / 2 + 0.20
sy = cy - h / 2 + 0.20
src_rect = Rectangle(
(sx, sy), src_w, src_h,
linewidth=1.0, facecolor="white", edgecolor=_BLUE, zorder=4,
)
ax.add_patch(src_rect)
ax.text(sx + src_w / 2, sy + src_h / 2 + 0.18, "source",
ha="center", va="center", fontsize=8.5, color=_BLUE,
fontweight="bold", zorder=5)
ax.text(sx + src_w / 2, sy + src_h / 2 - 0.18, "buffer",
ha="center", va="center", fontsize=8.5, color=_BLUE,
fontweight="bold", zorder=5)
# Slot ring region (right part).
ring_x0 = sx + src_w + 0.30
ring_w = (cx + w / 2 - 0.20) - ring_x0
ring_y0 = sy
ring_h = src_h
ring_rect = Rectangle(
(ring_x0, ring_y0), ring_w, ring_h,
linewidth=1.0, facecolor="white", edgecolor=_ORANGE, zorder=4,
)
ax.add_patch(ring_rect)
ax.text(
ring_x0 + ring_w / 2, ring_y0 + ring_h - 0.18,
"IPCQ slot ring (intra_W)",
ha="center", va="top", fontsize=8.5, color=_ORANGE,
fontweight="bold", zorder=5,
)
# Draw 8 slots in a 2×4 grid.
cols = 4
rows = 2
slot_inner_pad = 0.12
sw = (ring_w - (cols + 1) * slot_inner_pad) / cols
sh = (ring_h - 0.65 - (rows + 1) * slot_inner_pad) / rows
for i in range(n_slots):
r = i // cols
c = i % cols
sx_i = ring_x0 + slot_inner_pad + c * (sw + slot_inner_pad)
sy_i = (ring_y0 + slot_inner_pad
+ (rows - 1 - r) * (sh + slot_inner_pad))
is_active = (i == active_slot)
face = "#ffd9b8" if is_active else "white"
edge = _ORANGE if is_active else "#c9c9c9"
ax.add_patch(Rectangle(
(sx_i, sy_i), sw, sh,
linewidth=1.0, facecolor=face, edgecolor=edge, zorder=5,
))
ax.text(
sx_i + sw / 2, sy_i + sh / 2, f"s{i}",
ha="center", va="center", fontsize=8,
fontweight="bold" if is_active else "normal",
color=_ORANGE if is_active else "#666",
zorder=6,
)
def emit_ipcq_dma_diagram() -> str:
"""Two-PE diagram emphasising: outbound DMA writes DIRECTLY into the
receiver's local memory (slot ring in PE_TCM). pe1.pe_dma is the
inbound memory port that pays drain + emits the MetaArrival notice;
the actual DMA payload terminates in the slot, not in another DMA.
"""
import matplotlib.pyplot as plt
from matplotlib.patches import FancyBboxPatch
_OUT_DIR.mkdir(parents=True, exist_ok=True)
fig, ax = plt.subplots(figsize=(22, 12), facecolor="white")
XMAX, YMAX = 28.0, 14.0
ax.set_xlim(0, XMAX)
ax.set_ylim(0, YMAX)
ax.set_aspect("equal")
ax.axis("off")
ax.set_facecolor(_BG)
# Outer page border.
ax.add_patch(FancyBboxPatch(
(0.20, 0.20), XMAX - 0.40, YMAX - 0.40,
boxstyle="round,pad=0.02,rounding_size=0.20",
linewidth=1.4, edgecolor=_FRAME, facecolor=_BG, zorder=0,
))
ax.set_title(
"Two PEs over IPCQ — outbound DMA lands DIRECTLY in receiver "
"memory (slot ring in PE_TCM)",
fontsize=14, fontweight="bold", color=_TEXT, pad=12,
)
# ── PE panels ───────────────────────────────────────────────────
PE0_X, PE0_W = 0.8, 11.6
PE1_X, PE1_W = 15.6, 11.6
PE_Y, PE_H = 1.6, 10.4
_pe_panel(ax, x0=PE0_X, y0=PE_Y, w=PE0_W, h=PE_H,
label="PE 0 (sender — sip0.cube0.pe0)",
edge=_BLUE, fill="white")
_pe_panel(ax, x0=PE1_X, y0=PE_Y, w=PE1_W, h=PE_H,
label="PE 1 (receiver — sip0.cube0.pe1)",
edge=_GREEN, fill="white")
# ── PE 0 sub-blocks ─────────────────────────────────────────────
# Top row: PE_CPU and PE_IPCQ
_sub_block(
ax, cx=PE0_X + 2.5, cy=10.3, w=3.4, h=1.6,
title="PE_CPU",
body_lines=["kernel:",
" tl.send(dir='intra_E',",
" src=ptr)"],
fill=_BOX_FILL, edge=_BOX_EDGE,
)
_sub_block(
ax, cx=PE0_X + 8.4, cy=10.3, w=4.0, h=1.6,
title="PE_IPCQ (control / SFR)",
body_lines=["per-direction state:",
" head/tail, peer.rx_base_pa,",
" peer_tail_cache"],
fill=_BOX_FILL, edge=_BOX_EDGE,
)
# Mid: PE_TCM (left, with src + slot ring) and PE_DMA outbound (right)
_tcm_with_slots(
ax, cx=PE0_X + 3.0, cy=5.4, w=5.6, h=3.6,
n_slots=8, active_slot=-1,
title="PE_TCM (local memory · buffer_kind = tcm)",
)
_sub_block(
ax, cx=PE0_X + 8.6, cy=5.4, w=3.6, h=3.6,
title="PE_DMA (outbound)",
body_lines=["snapshot src bytes",
" from PE_TCM",
"build Transaction",
" (dst = peer's slot PA)",
"fire onto fabric;",
" do not wait for ack"],
fill=_HW_FILL, edge=_HW_EDGE,
)
# Arrows on PE 0 side
_arrow(ax, (PE0_X + 4.20, 10.3), (PE0_X + 6.40, 10.3),
color=_BLUE, lw=1.7)
ax.text(PE0_X + 5.30, 10.65, "tl.send",
ha="center", va="center", fontsize=8.5, color=_BLUE,
fontweight="bold")
# PE_IPCQ → PE_DMA control (kept; label removed per request)
_arrow(ax, (PE0_X + 8.4, 9.50), (PE0_X + 8.6, 7.20),
color=_ORANGE, lw=1.6)
# PE_TCM(src) → PE_DMA (read source data)
_arrow(ax, (PE0_X + 5.80, 5.40), (PE0_X + 6.80, 5.40),
color=_BLUE, lw=2.0)
ax.text(PE0_X + 6.30, 6.05, "read source\n(snapshot)",
ha="center", va="bottom", fontsize=7.5, color=_BLUE,
family="monospace")
# ── Fabric in the middle ────────────────────────────────────────
FAB_X0, FAB_W = 12.6, 2.8
FAB_Y0, FAB_H = 4.6, 2.2
ax.add_patch(FancyBboxPatch(
(FAB_X0, FAB_Y0), FAB_W, FAB_H,
boxstyle="round,pad=0.04,rounding_size=0.20",
linewidth=1.6, edgecolor=_PURPLE, facecolor="white", zorder=2,
))
ax.text(FAB_X0 + FAB_W / 2, FAB_Y0 + FAB_H - 0.45,
"NoC Fabric", ha="center", va="center",
fontsize=12, fontweight="bold", color=_PURPLE)
ax.text(FAB_X0 + FAB_W / 2, FAB_Y0 + 0.55,
"(routers, links;\nfabric BW + drain time)",
ha="center", va="center", fontsize=8.5, color=_TEXT)
# ── PE 1 sub-blocks ─────────────────────────────────────────────
# Top row: PE_IPCQ and PE_CPU
_sub_block(
ax, cx=PE1_X + 3.2, cy=10.3, w=4.0, h=1.6,
title="PE_IPCQ (control / SFR)",
body_lines=["per-direction state:",
" head/tail, peer_head_cache,",
" my_rx_base_pa"],
fill=_BOX_FILL, edge=_BOX_EDGE,
)
_sub_block(
ax, cx=PE1_X + 9.1, cy=10.3, w=3.4, h=1.6,
title="PE_CPU",
body_lines=["kernel:",
" ptr = tl.recv(",
" dir='intra_W')"],
fill=_BOX_FILL, edge=_BOX_EDGE,
)
# Wide PE_TCM occupying the centre-bottom of PE 1 — the DMA payload
# terminates HERE (not in any DMA component).
_tcm_with_slots(
ax, cx=PE1_X + 5.0, cy=5.4, w=8.4, h=3.6,
n_slots=8, active_slot=3,
title="PE_TCM (local memory · buffer_kind = tcm)",
)
# ── DATA arrows: outbound DMA ──► RECEIVER MEMORY (the slot) ───
# The inbound PE_DMA is NOT on the data path — it's a sim-side
# bookkeeper that pays terminal drain + emits MetaArrival. The
# actual DMA payload jumps fabric → slot directly.
# 1) pe0.PE_DMA → fabric
_arrow(ax, (PE0_X + 10.40, 5.40), (FAB_X0, 5.40),
color=_BLUE, lw=2.8)
# 2) fabric → PE_TCM slot s3 (DMA payload terminates IN MEMORY)
SLOT_X = PE1_X + 2.95 # x-coordinate of slot s3 within PE_TCM
_arrow(ax, (FAB_X0 + FAB_W, 5.40), (SLOT_X, 5.40),
color=_BLUE, lw=2.8)
# PE_IPCQ → PE_CPU: tl.recv unblocks
_arrow(ax, (PE1_X + 5.20, 10.30), (PE1_X + 7.40, 10.30),
color=_GREEN, lw=1.7)
ax.text(PE1_X + 6.30, 10.65, "unblock tl.recv",
ha="center", va="center", fontsize=8.5, color=_GREEN,
fontweight="bold")
# PE_CPU → PE_TCM: kernel reads consumed slot via returned ptr
_arrow(ax, (PE1_X + 9.10, 9.50), (PE1_X + 8.10, 7.20),
color=_GREEN, lw=1.4, curve=0.10)
ax.text(PE1_X + 9.30, 8.30, "kernel reads\nslot data",
ha="left", va="center", fontsize=7.5, color=_GREEN)
# (Credit-return arrow + label removed per request — see code
# for the actual mechanism: pe1.pe_ipcq → pe0.credit_inbox via
# SimPy Store after env.timeout(fabric_path_latency_ns).)
# ── Footer legend ──────────────────────────────────────────────
ax.text(0.6, 0.85,
"DATA (blue) : pe0 PE_TCM[src] → pe0 PE_DMA → "
"NoC fabric → pe1 PE_TCM[slot s3] ← DMA write "
"terminates IN MEMORY",
ha="left", va="center", fontsize=9, color=_TEXT,
style="italic")
ax.text(0.6, 0.45,
"CTRL (orange) : PE_IPCQ issues IpcqDmaToken on send; "
"pe1's inbound port emits MetaArrival; credit return "
"uses the fabric path (timing) but bypasses the per-hop "
"component graph (D9 fast path).",
ha="left", va="center", fontsize=9, color=_TEXT,
style="italic")
out_path = _OUT_DIR / "ipcq_two_pe_dma.png"
fig.savefig(out_path, dpi=130, bbox_inches="tight",
facecolor=fig.get_facecolor())
plt.close(fig)
return str(out_path)
def test_emit_ipcq_dma_diagram():
out = emit_ipcq_dma_diagram()
assert Path(out).exists()
+139
View File
@@ -0,0 +1,139 @@
"""Phase 1 test for moving the intercube_allreduce root cube from the
bottom-right corner (3,3) to the geometric center (2,2).
Today's algorithm (intercube_allreduce.py) hardcodes
``root_cube = (cube_h-1) * cube_w + (cube_w-1)`` (= cube 15 in 4×4).
The intra-SIP critical path for one allreduce is therefore::
Phase 1 (row reduce W→E to col 3) : 3 hops
Phase 2 (col reduce N→S to row 3 on col 3): 3 hops
Phase 3 (inter-SIP at root) : (separate)
Phase 4 (col broadcast S→N) : 3 hops
Phase 5 (row broadcast E→W) : 3 hops
Total intra-SIP critical path : 12 hops
Moving the root to (2,2) and using BIDIRECTIONAL convergence (cols 0..2
go W→E, col 3 goes E→W in parallel; rows 0..2 go N→S, row 3 goes S→N
in parallel) cuts each phase's critical path from 3 hops to 2::
Phase 1 critical path : max(2, 1) = 2 hops
Phase 2 critical path : max(2, 1) = 2 hops
Phase 4 critical path : 2 hops
Phase 5 critical path : 2 hops
Total intra-SIP critical path : 8 hops
Per-hop cost at 96 KB on TCM ≈ 600 ns (slot IO write+read 384 ns +
fabric drain ~217 ns). 4 fewer hops ⇒ ~2.4 µs reduction.
EXPECTED Phase 1 outcome:
- Today (root = corner) : ~22.0 µs ← test FAILS (> 20500 ns)
- After Phase 2 (root = center) : ~19.6 µs ← test PASSES (< 20500 ns)
"""
from __future__ import annotations
from pathlib import Path
import pytest
from kernbench.runtime_api.context import RuntimeContext
from kernbench.runtime_api.types import DeviceSelector
from kernbench.sim_engine.engine import GraphEngine
from kernbench.topology.builder import resolve_topology
from tests.test_allreduce_multidevice import (
_write_temp_configs,
run_allreduce,
)
def _run_torus_96kb(tmp_path: Path) -> float:
"""Run torus_2d 6-SIP allreduce at 96 KB / slot, return critical-path
pe_exec_ns. Fixed at TCM (the project default)."""
sub = tmp_path / "torus_root_center"
sub.mkdir()
topo_path, ccl_path = _write_temp_configs(
sub,
sip_topology="torus_2d",
n_sips=6,
algorithm="intercube_allreduce",
sip_w=3, sip_h=2,
n_elem_override=49152, # 49152 × 2 = 96 KB / slot
)
topo = resolve_topology(topo_path)
engine = GraphEngine(topo.topology_obj, enable_data=True)
spec = topo.topology_obj.spec
with RuntimeContext(
engine=engine,
target_device=DeviceSelector("all"),
correlation_id="root_center_phase1",
spec=spec,
) as ctx:
result = run_allreduce(
ctx, engine, spec,
algorithm="intercube_allreduce", ccl_yaml=ccl_path,
)
assert result["ok_cubes"] > 0
pe_exec_vals = [
float(tr.get("pe_exec_ns", 0.0) or 0.0)
for _, (_, tr) in engine._results.items()
if isinstance(tr, dict)
]
return max(pe_exec_vals) if pe_exec_vals else 0.0
def test_intra_sip_critical_path_at_96k_below_threshold(tmp_path):
"""Post-Phase-2 (root=center, bidirectional reduce) the torus_2d
96 KB allreduce on TCM should drop below 20.5 µs.
Today's value: ~22.0 µs (12-hop critical path with corner root).
Expected post-Phase-2: ~19.6 µs (8-hop critical path with
center root) — model estimate, ~11% reduction end-to-end.
"""
lat_ns = _run_torus_96kb(tmp_path)
THRESHOLD_NS = 20_500.0
assert lat_ns < THRESHOLD_NS, (
f"torus_2d 6-SIP 96 KB allreduce should land below "
f"{THRESHOLD_NS:.0f} ns post-Phase-2 (root=center, "
f"bidirectional reduce). got {lat_ns:.1f} ns "
f"({lat_ns / 1000:.2f} µs)"
)
def test_correctness_preserved(tmp_path):
"""Smoke check: at small n_elem the new algorithm must still produce
the correct sum across all 96 cubes. ``run_allreduce`` validates
every cube against the expected reduce result (``ok_cubes`` must be
96 = 6 SIPs × 16 cubes).
This guards against the obvious Phase 2 risk: bidirectional reduce
sums each contribution exactly once. If implemented wrong (double-
counting or skipping the right edge column / bottom row), the
asserts inside run_allreduce fail.
"""
sub = tmp_path / "correctness"
sub.mkdir()
topo_path, ccl_path = _write_temp_configs(
sub,
sip_topology="torus_2d",
n_sips=6,
algorithm="intercube_allreduce",
sip_w=3, sip_h=2,
n_elem_override=128, # tiny payload to keep this fast
)
topo = resolve_topology(topo_path)
engine = GraphEngine(topo.topology_obj, enable_data=True)
spec = topo.topology_obj.spec
with RuntimeContext(
engine=engine,
target_device=DeviceSelector("all"),
correlation_id="root_center_correctness",
spec=spec,
) as ctx:
result = run_allreduce(
ctx, engine, spec,
algorithm="intercube_allreduce", ccl_yaml=ccl_path,
)
n_cubes = 6 * 16 # 6 SIPs × 16 cubes/SIP
assert result["ok_cubes"] == n_cubes, (
f"all 96 cubes must validate; got {result['ok_cubes']} OK"
)
+219
View File
@@ -0,0 +1,219 @@
"""Phase 1 micro-tests for IPCQ slot-memory latency model.
These tests assert the TARGET behavior expected after Phase 2 wires
``buffer_kind`` (tcm/sram/hbm) into the IPCQ slot read/write latency
charges. They are written BEFORE the production change and are
EXPECTED TO FAIL today.
Failure semantics today:
- Slot access is latency-free, so the tcm/sram/hbm runs produce
identical pe_exec_ns. The ordering assertion therefore fails with
"tcm == sram == hbm" — proving the test harness is wired and that
Phase 2 production work is what makes them pass.
Reference (Phase 2 will edit these):
- src/kernbench/components/builtin/pe_dma.py — _handle_ipcq_inbound
- src/kernbench/components/builtin/pe_ipcq.py — _handle_recv,
_BUFFER_KIND_BW table
- src/kernbench/runtime_api/kernel.py — IpcqDmaToken adds
buffer_kind field
- ccl.yaml — algorithm.buffer_kind
The tests reuse the existing config-driven allreduce app
(``run_allreduce`` in tests/test_allreduce_multidevice.py) with a 2-SIP
ring topology and a SMALL n_elem so they finish fast (~3-5 s each).
"""
from __future__ import annotations
from pathlib import Path
from typing import Any
import pytest
from kernbench.runtime_api.context import RuntimeContext
from kernbench.runtime_api.types import DeviceSelector
from kernbench.sim_engine.engine import GraphEngine
from kernbench.topology.builder import resolve_topology
# Reuse the test app's helpers so this micro-test file does not
# duplicate the run-allreduce + write-temp-configs plumbing.
from tests.test_allreduce_multidevice import (
_write_temp_configs,
run_allreduce,
)
# Expected per-tier BW + overhead (Phase 2 will encode this in
# pe_ipcq.py). Mirrors topology.yaml component values.
_EXPECTED_BW = {
"tcm": (512.0, 0.0),
"sram": (128.0, 2.0),
"hbm": (32.0, 6.0),
}
def _expected_slot_io_ns(buffer_kind: str, nbytes: int) -> float:
"""Per-access latency the model is expected to add (write OR read)."""
bw_gbs, overhead_ns = _EXPECTED_BW[buffer_kind]
# 1 GB/s = 1 byte/ns
return nbytes / bw_gbs + overhead_ns
def _run_torus_allreduce(
tmp_path: Path, *, buffer_kind: str, n_elem: int,
) -> float:
"""Run one torus_2d 6-SIP allreduce and return critical-path
pe_exec_ns. The buffer_kind override is wired into ccl.yaml.
"""
sub = tmp_path / f"{buffer_kind}_{n_elem}"
sub.mkdir()
topo_path, ccl_path = _write_temp_configs(
sub,
sip_topology="torus_2d",
n_sips=6,
algorithm="intercube_allreduce",
sip_w=3, sip_h=2,
n_elem_override=n_elem,
)
# Patch ccl.yaml in-place so the algorithm picks up buffer_kind.
import yaml
with open(ccl_path) as f:
ccl_cfg = yaml.safe_load(f)
ccl_cfg.setdefault("defaults", {})["buffer_kind"] = buffer_kind
ccl_cfg.setdefault("algorithms", {}).setdefault(
"intercube_allreduce", {},
)["buffer_kind"] = buffer_kind
with open(ccl_path, "w") as f:
yaml.dump(ccl_cfg, f, default_flow_style=False)
topo = resolve_topology(topo_path)
engine = GraphEngine(topo.topology_obj, enable_data=True)
spec = topo.topology_obj.spec
with RuntimeContext(
engine=engine,
target_device=DeviceSelector("all"),
correlation_id=f"bk_{buffer_kind}_{n_elem}",
spec=spec,
) as ctx:
result = run_allreduce(
ctx, engine, spec,
algorithm="intercube_allreduce", ccl_yaml=ccl_path,
)
assert result["ok_cubes"] > 0, "allreduce did not validate"
pe_exec_vals = [
float(tr.get("pe_exec_ns", 0.0) or 0.0)
for _, (_, tr) in engine._results.items()
if isinstance(tr, dict)
]
return max(pe_exec_vals) if pe_exec_vals else 0.0
# ── Phase 1 assertions ───────────────────────────────────────────────
def test_slot_write_latency_orders_tcm_sram_hbm(tmp_path):
"""tcm < sram < hbm at 8192 B per send.
Pre-Phase-2: all three return the same pe_exec_ns and this
assertion fails. Post-Phase-2: the per-tier BW + overhead make
hbm visibly slower than sram, which is slower than tcm.
"""
n_elem = 4096 # 8192 B per slot
lat_tcm = _run_torus_allreduce(tmp_path, buffer_kind="tcm", n_elem=n_elem)
lat_sram = _run_torus_allreduce(tmp_path, buffer_kind="sram", n_elem=n_elem)
lat_hbm = _run_torus_allreduce(tmp_path, buffer_kind="hbm", n_elem=n_elem)
# Expected per-access deltas (write+read = 2× the per-access value).
exp_tcm = 2 * _expected_slot_io_ns("tcm", n_elem * 2)
exp_sram = 2 * _expected_slot_io_ns("sram", n_elem * 2)
exp_hbm = 2 * _expected_slot_io_ns("hbm", n_elem * 2)
# Floor margin: 50% of the raw expected per-access delta — lets Phase 2
# implementation choose to charge only one side without breaking the test,
# but still requires a clearly observable gap.
margin_sram_tcm = 0.5 * (exp_sram - exp_tcm)
margin_hbm_sram = 0.5 * (exp_hbm - exp_sram)
assert lat_sram > lat_tcm + margin_sram_tcm, (
f"sram should be slower than tcm by ≥ {margin_sram_tcm:.1f} ns "
f"per allreduce, got sram={lat_sram:.1f} tcm={lat_tcm:.1f} "
f"(delta={lat_sram - lat_tcm:.1f})"
)
assert lat_hbm > lat_sram + margin_hbm_sram, (
f"hbm should be slower than sram by ≥ {margin_hbm_sram:.1f} ns "
f"per allreduce, got hbm={lat_hbm:.1f} sram={lat_sram:.1f} "
f"(delta={lat_hbm - lat_sram:.1f})"
)
def test_slot_io_scales_linearly_with_nbytes(tmp_path):
"""For buffer_kind=hbm, doubling nbytes should add ~nbytes/32 ns
of latency to each slot access. Sanity-checks the slope.
Pre-Phase-2: latency does not respond to nbytes via memory BW
(only via fabric drain), so the observed slope is dominated by
fabric BW and does NOT match 1/32 ns/B.
"""
lat_4k = _run_torus_allreduce(tmp_path, buffer_kind="hbm", n_elem=2048)
lat_8k = _run_torus_allreduce(tmp_path, buffer_kind="hbm", n_elem=4096)
# Expected delta from doubling: at least one slot-IO event per cube
# in the critical path (very conservative). Per-access add = 4096/32 ≈ 128
# ns on HBM going from 4k → 8k. Multiple slot accesses on the critical
# path should make the observed delta meaningfully larger.
expected_min_delta = 0.5 * (4096 / 32.0) # ≈ 64 ns
assert lat_8k - lat_4k > expected_min_delta, (
f"doubling nbytes on hbm should add ≥ {expected_min_delta:.1f} ns "
f"of slot-IO latency, got delta={lat_8k - lat_4k:.1f} ns "
f"(lat_4k={lat_4k:.1f}, lat_8k={lat_8k:.1f})"
)
def test_buffer_kind_sensitivity_grows_with_payload(tmp_path):
"""Credit-return cost is fabric-only by design (16 B packet); only
the data slot-IO charge depends on ``buffer_kind``. Therefore the
tcm-vs-hbm gap must scale with payload size and be a small fraction
of the large-payload gap at small payloads.
Concrete invariant the model must satisfy:
gap_small / gap_large < 0.10
Pre-Phase-2: gap_small == gap_large == 0 (division undefined → test
fails because gap_large is required > 0). Post-Phase-2: at small
nbytes the slot-IO charge is dominated by the constant
``overhead_ns`` term, while at large nbytes it is dominated by the
``nbytes / bw_gbs`` term — so gap_large grows linearly while
gap_small stays small.
"""
n_elem_small = 8 # 16 B per slot — overhead-bound
n_elem_large = 16384 # 32 KB per slot — bandwidth-bound
lat_tcm_small = _run_torus_allreduce(
tmp_path, buffer_kind="tcm", n_elem=n_elem_small,
)
lat_hbm_small = _run_torus_allreduce(
tmp_path, buffer_kind="hbm", n_elem=n_elem_small,
)
lat_tcm_large = _run_torus_allreduce(
tmp_path, buffer_kind="tcm", n_elem=n_elem_large,
)
lat_hbm_large = _run_torus_allreduce(
tmp_path, buffer_kind="hbm", n_elem=n_elem_large,
)
gap_small = abs(lat_hbm_small - lat_tcm_small)
gap_large = abs(lat_hbm_large - lat_tcm_large)
assert gap_large > 1000.0, (
f"large-payload buffer_kind gap must be observably large "
f"(this is the sweep's whole point). got gap_large={gap_large:.1f} ns "
f"(lat_tcm_large={lat_tcm_large:.1f}, lat_hbm_large={lat_hbm_large:.1f})"
)
assert gap_small / gap_large < 0.10, (
f"buffer_kind sensitivity should grow with payload — "
f"small-payload gap should be < 10% of large-payload gap. "
f"got gap_small={gap_small:.1f} ns, gap_large={gap_large:.1f} ns, "
f"ratio={gap_small / gap_large:.3f}"
)