12 Commits

Author SHA1 Message Date
ywkang fca24feac5 Fix all remaining test failures: single-cube allreduce + matplotlib dep
- intercube_allreduce: add single-cube fast path that skips intra-SIP
  mesh reduce and goes directly to inter-SIP exchange. Fixes IPCQ
  deadlock when TP launches kernel on one cube per SIP.
- distributed.py: derive effective cube dims from tensor shard placement
  instead of hardcoding topology mesh size.
- pyproject.toml: add matplotlib>=3.7 to dependencies.
- pe_dma.py (prior commit): add MMU translation in pipeline DMA path.

577 passed, 0 failed (was 529 passed, 10 failed).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-27 21:25:31 -07:00
ywkang d55dc6cb4f Merge: accept remote pe2pe summary.csv 2026-04-27 17:13:06 -07:00
mukesh 46291bf91b PE-to-PE latency: drop h5 inter-SIP panel from overview
Remove h5_inter_sip from the hop list and switch the overview grid
from 2x3 to 2x2. RAW DMA was unavailable for the cross-SIP hop, so
the panel only carried IPCQ data and was redundant with h4_inter_cube
for the topology comparison.

Regenerate pe2pe_latency_plots/overview.png and summary.csv; delete
the obsolete h5_inter_sip.png.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 16:43:28 -07:00
mukesh 04c912f53e Allreduce sweep: parametrized + xdist parallelism + topology diagram
Refactor the latency sweep from one giant test into 36 parametrized
cases that run in parallel under xdist (~6-8x faster: 1:49 instead of
~10 min). Each case writes a JSON row to a staging dir; conftest
sessionfinish hook aggregates rows on the controller node into
summary.csv and the per-topology + overview plots.

Aggregator gains a CSV fallback so plot-only tweaks no longer require
re-running the sweep.

Overview plot updates:
- 96 KB explicit x-axis marker with vertical dotted line
- horizontal theoretical 2D-torus reference (10600 ns)
- annotation showing both theoretical and simulated values at 96 KB
- drop overlapping 128 KB tick

New topology.png: 2x2 panel diagram showing device-level topology
(ring, torus 2x3, mesh 2x3) and the cube-level reduction inside SIP 0.
Wrap arrows anchor on box edges and arc outside rows/columns so they
do not overlap any SIP.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 16:43:19 -07:00
mukesh 1c33afec55 ADR-0032 + intra_* opposite directions in IPCQ install
Add intra_N/S/E/W to install.py _OPPOSITE_DIR table so the intra-cube
PE-to-PE namespace is symmetrical with intercube N/S/E/W. ADR-0032
documents the intercube allreduce algorithm (supersedes ADR-0029).
Refresh ADR-0024/0025/0029 cross-refs and update
test_intercube_sfr_config.py to cover the new intra_* mappings. Drop
the obsolete test_ccl_round_robin_recv.py (replaced by intercube tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 16:43:01 -07:00
ywkang 81cc32c46b ADR-0001 Rev 2: 51-bit PhysAddr layout with concrete sub-unit tables
Remove rack_id (4 bits), rename sip_seg→die_id, shift fields to enable
42-bit local_offset (4 TB per die). Define PE_LOCAL/MCPU_LOCAL/CUBE_SRAM
sub-unit tables for AHBM dies and IOCPU sub-unit table for IOCHIPLET
dies (1 TB window). Supersedes ADR-0031.

Also fixes latent VA/PA confusion in pe_dma pipeline DMA path where
virtual addresses were decoded as physical addresses without MMU
translation — previously masked by coincidental bit-position alignment.

529 passed (+6 recovered), 10 pre-existing failures unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-27 15:52:29 -07:00
mukesh e9cc40f74d Rectangular SIP topology + 6-device allreduce sweep
mesh_2d, torus_2d, and mesh_2d_no_wrap accept optional w,h kwargs;
sqrt fall-back preserved for square layouts (back-compat tests
confirm 4-SIP and 9-SIP square configs still work). sfr_config
reads system.sips.w/h from spec and threads dims through to the
topology fn.

test_allreduce_multidevice CONFIGS switched from 4 SIPs (square)
to 6 SIPs: ring_1d_6sip, torus_2d_6sip_2x3, mesh_2d_no_wrap_6sip_2x3.
_write_temp_configs writes system.sips.w/h when supplied;
_sip_topo_dims reads them back. Latency sweep loop also moved to
6-SIP layouts. Linear-scale plot variants dropped -- only log-scale
*.png + summary.csv emitted. Plots in tests/allreduce_latency_plots
regenerated.

New tests/test_sip_topology_rectangular.py asserts neighbor
correctness for 2x3 layouts and back-compat for square fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 15:13:14 -07:00
mukesh c1a5cf3a2a ADR-0009 D5: chain-aware target_start_ns + zero-byte launch fanout
The single-walk predictor (find_node_path(io_cpu, pe_cpu) +
compute_path_latency_ns) under-shot actual dispatch latency for far
cubes -- the routing graph could pick a path bypassing M_CPU, and
non-zero-nbytes launch sub-txns serialized on shared first hops.
Far PEs arrived at _execute_kernel after target_start_ns, silently
skipped the barrier yield, and started pe_exec_start late. Their
reported pe_exec_ns under-counted by exactly the late_ns amount
(63 ns observed at h4 cube4.pe0 in the IPCQ test, up to 113 ns
worst case for cubes 9-11), producing the suspicious flat region
in the h4 IPCQ curve at 8192/10240 bytes.

Fix:
  - IO_CPU predictor uses the explicit two-leg chain
    (IO_CPU->M_CPU + M_CPU->PE_CPU - io.overhead - m.overhead), so
    every PE on every targeted cube has a barrier >= its real
    dispatch arrival.
  - Kernel-launch fanout sub-txns carry nbytes=0 (control-plane,
    not data-plane), removing the per-cube fanout serialization
    that pushed far M_CPUs past the predictor.
  - Legacy io_cpu mirror updated.

ADR-0009 D5 mechanism updated to specify the two-leg formula and
the nbytes=0 requirement. New tests/test_d5_barrier_invariant.py
asserts (a) no PE enters _execute_kernel after target_start_ns and
(b) every PE in a multi-cube launch has identical pe_exec_start --
both regressions silently pass on the existing
tests/test_kernel_launch_sync.py because that test only inspects
post-aggregation max(pe_exec_ns).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 15:12:58 -07:00
mukesh 90874abbfe ADR-0023 D9: blocking credit-emit with full-path latency
PE_IPCQ._handle_recv now yields-from _delayed_credit_send instead of
spawning it as a fork, so the receiver's pe_exec_ns includes the
credit-return cost. _credit_latency_ns switches from
compute_drain_ns(path, 16) to compute_path_latency_ns(path, 16) and
fixes a latent find_path bug where the destination lacked the
".pe_dma" suffix (silently returned 0 ns under the bare except).

Net effect on h3/h4 inter-cube pe-to-pe latency: IPCQ >= raw DMA at
every size, matching real-HW posted-write semantics. tl.send remains
fire-and-forget. ADR-0023 D9 amended; new diagnostic test
tests/test_pe_to_pe_diagnostic.py captures per-PE pe_exec_ns, paths,
drain, and meta-arrival timing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 15:12:38 -07:00
mukesh 19dfc86dc3 Allreduce latency sweep across topologies and data sizes
Adds test_allreduce_latency_sweep that runs the existing intercube
allreduce kernel under three SIP topologies (ring_1d, torus_2d,
mesh_2d_no_wrap, all at n_sips=4) across 11 data sizes from 256 B/SIP
up to 1 MB/SIP. For each point, captures max(pe_exec_ns) — the
critical-path kernel time — and emits CSV plus log-x and linear-x
plots, both per-topology and combined overview, with KB/MB-formatted
tick labels. Reuses run_allreduce + _write_temp_configs and adds a
slot_size auto-bump when n_elem*2 exceeds the default IPCQ slot.

Sweep skips n_elem=16 because the runtime's dim_map scalar-arg
remapping (context.py:761) collides any int-valued kernel scalar that
matches a global tensor dim with its local shard size.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 10:16:29 -07:00
mukesh 14d800b0ae Kernel-launch sync (ADR-0009 D5) and IPCQ drain at inbound (ADR-0023)
- KernelLaunchMsg gains target_start_ns: IO_CPU stamps a global barrier
  (max path latency across every target PE), M_CPU passes it through,
  PE_CPU yields until it before recording pe_exec_start. Every PE in a
  launch begins kernel execution at the same env.now regardless of its
  dispatch path length — eliminates per-PE dispatch-offset artifact in
  cross-PE and cross-cube latency measurements.

- PE_DMA._handle_ipcq_inbound now pays Transaction.drain_ns at the top,
  matching the terminal-drain behavior of ComponentBase._forward_txn for
  every non-IPCQ Transaction. SRC-side tl.send stays fire-and-forget
  (sender doesn't yield on sub_done); tl.recv now blocks until bytes
  have actually drained into its inbox.

- ComponentContext: new compute_path_latency_ns helper + node_overhead_ns
  field populated by GraphEngine.

- tests/test_kernel_launch_sync.py: asserts all PEs in one launch
  produce identical pe_exec_ns for a no-op kernel (zero spread).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 15:30:29 -07:00
mukesh 6918e6e906 PE-to-PE latency test + supporting fixes
Adds tests/test_pe_to_pe_latency.py: a sweep that measures PE-to-PE
transfer latency for five hop types (intra-cube horizontal/vertical,
inter-cube horizontal/vertical, inter-SIP) across data sizes 128 B to
10 KB, on both the IPCQ (tl.send/tl.recv) and raw-DMA (tl.load+tl.store)
paths. Emits per-hop PNG plots, an overview PNG, and a CSV summary into
tests/pe2pe_latency_plots/. Latency is reported as max(pe_exec_ns) across
participating PEs, read from engine.get_completion(), so the measurement
captures the SRC/DST PE's kernel body time rather than the full launch+
response-aggregation envelope.

Two simulator fixes were needed to make this measurement meaningful:

- PeMMU now stores a list of (start, end, pa) sub-regions per page
  rather than a single PA. DPPolicy layouts with shards smaller than
  page_size (e.g. 128 B payloads with 4 KB pages) used to silently
  overwrite each other through last-write-wins, causing DMAs intended
  for cube0 to physically route to cube3 - inflating latency by ~170 ns
  per DMA at small sizes. STOPGAP: real MMUs don't support sub-page
  regions; long-term fix is either smaller MMU page size or DPPolicy
  validation that refuses sub-page shards.

- M_CPU's per-PE metrics aggregation (pe_exec_ns, dma_ns, compute_ns)
  now max-merges against the existing value in result_data rather than
  overwriting. Multi-cube workloads share one result_data dict via
  IO_CPU fanout; the previous overwrite caused whichever cube's M_CPU
  finished last to clobber others' values, so multi-cube pe_exec_ns was
  racy and frequently 0. Same fix applied in legacy/builtin/m_cpu.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 21:04:31 -07:00
70 changed files with 4209 additions and 584 deletions
+303 -49
View File
@@ -1,25 +1,39 @@
# ADR-0001: PhysAddr Layout & Address Decoding Contract
# ADR-0001: 51-bit Physical Address Layout & Decoding Contract
## Status
Accepted
Accepted (Revision 2 — 2026-04-27: concrete bit layout, rack_id removal,
Tray->SIP / SIP->DIE renaming, PE/MCPU/IOCPU sub-unit tables.
Supersedes ADR-0031.)
## Date
2026-02-27
2026-04-27 (original: 2026-02-27)
## Context
KernBench Graph Latency Simulator must route requests deterministically and compute end-to-end latency strictly by graph traversal.
To model local vs remote traffic (same/different SIP, same/different CUBE, optional PE-group), requests need a stable, parsable address/location scheme that:
KernBench requires a stable, parsable physical address scheme that:
- can be decoded into routing domains (SIP/CUBE/HBM/PE-resource, etc.)
- can be decoded into routing domains (SIP / die / HBM / PE-resource / IOCPU)
- remains topology-agnostic (no hardcoded counts)
- supports swappable policy and DI-first components without leaking topology assumptions into node implementations
- supports swappable policy and DI-first components
- covers multiple SIPs, AHBM dies, and IO chiplet dies in a unified space
### History
- Original ADR-0001 defined a 51-bit layout with `rack_id(4) + sip_id(4) +
sip_seg(5) + local_offset(38)`. `rack_id` was never used in practice.
- ADR-0031 (stub) requested PE-resource range partition but was never
implemented.
Revision 2 removes `rack_id`, renames `sip_seg -> die_id`, and provides
concrete sub-unit tables for PE, MCPU, CUBE_SRAM, and IOCPU resources.
ADR-0031 is superseded.
## Decision
We define a **PhysAddr value object** and an **address decoding contract** that converts an integer address into routing domains.
We define a **PhysAddr value object** and an **address decoding contract**
that converts an integer address into routing domains.
### D1. PhysAddr is an immutable value object
@@ -27,82 +41,322 @@ We define a **PhysAddr value object** and an **address decoding contract** that
- Any allocator returns a **fully specified PhysAddr** (not partial metadata).
- No global state may be required to interpret a PhysAddr.
### D2. PhysAddr fields (logical contract)
### D2. 51-bit Physical Address Layout
PhysAddr must be able to represent at least:
A 51-bit physical address is adopted.
- `rack_id` (optional but reserved for scale-out)
- `sip_id` (device / SIP domain)
- `sip_seg` (SIP-level segment/window selection, e.g., cube window)
- `local_offset` (offset within the chosen segment/window)
#### 2.1 Top-Level Address Map
Decoded/derived fields may include (optional):
```text
[50:47] sip_id (4) -- 16 SIPs
[46:42] die_id (5) -- 32 dies per SIP
[41: 0] local_offset (42) -- 4 TB per die
```
- `cube_id`
- `kind` (e.g., HBM vs PE-resource vs raw)
- `unit_type` / `pe_id` (if PE-level addressing is modeled)
```text
50 47 46 42 41 0
+---------+----------+-------------------------+
| sip_id | die_id | local_offset |
+---------+----------+-------------------------+
```
**Important:** The exact bit allocation may evolve, but the *semantic fields above* must remain decodable without hidden assumptions.
#### 2.2 die_id Allocation
### D3. Decoding is deterministic and policy-compatible
| die_id | Meaning |
|--------|---------|
| 0..15 | AHBM dies |
| 16..20 | IOCHIPLET dies |
| 21..31 | Reserved |
- Decoding must deterministically map an integer address to:
- destination SIP domain (`sip_id`)
- destination sub-domain (`cube_id` if applicable)
- destination target kind (HBM/PE-resource/other)
- Decoding must not depend on runtime topology sizes; it may depend on **explicit topology parameters** provided through configuration (e.g., segment size, slice size), and those parameters must live in the topology/config layer (not in random components).
#### 2.3 AHBM Die Layout
### D4. Topology-derived constants live in the topology layer
Only lower 256 GB of the 4 TB die-local window is assigned.
Constants such as segment sizes (e.g., HBM slice size / window size) are derived from topology configuration (YAML/JSON/dict) and are provided to the decoder via DI/config.
They must not be hardcoded in node implementations.
```text
[41:38] MBZ (4)
[37] addr_space (1) -- 0 = local resource, 1 = HBM memory
[36: 0] sub-address (37)
```
| addr_space | Meaning |
|------------|---------|
| 0 | Local resource |
| 1 | HBM memory |
##### 2.3.1 HBM Window (addr_space = 1)
```text
[36:0] hbm_offset (37) -- 128 GB decode window
```
The architectural decode window is fixed at 128 GB. Implemented capacity
may be smaller depending on SKU/topology (see D4).
##### 2.3.2 Resource Window (addr_space = 0)
```text
[36:34] resource_kind (3)
[33: 0] kind_local (34) -- 16 GB per kind
```
| resource_kind | Meaning |
|---------------|---------|
| 000 | PE_LOCAL |
| 001 | MCPU_LOCAL |
| 010 | CUBE_SRAM |
| 011..111 | Reserved |
Each kind gets a 16 GB decode region.
##### 2.3.3 PE_LOCAL (resource_kind = 000)
```text
[33] MBZ (1)
[32:29] pe_id (4) -- 0..15
[28:25] pe_sub_unit (4)
[24: 0] sub_offset (25) -- 32 MB per slot
```
16 PEs x 16 sub-unit slots x 32 MB = 8 GB active decode.
| pe_sub_unit | Name | Budget |
|-------------|------|--------|
| 0 | PE_CPU_DTCM | 8 KB |
| 1 | MATH_ENGINE_DTCM | 8 KB |
| 2 | IPCQ | 256 KB |
| 3 | PE_CPU_SFR | 16 KB |
| 4 | MATH_ENGINE_SFR | 16 KB |
| 5 | DMA_ENGINE_SFR | 192 KB |
| 6 | PE_TCM | 2 MB |
| 7..15 | Reserved | -- |
##### 2.3.4 MCPU_LOCAL (resource_kind = 001)
```text
[33:30] MBZ (4)
[29:25] mcpu_sub_unit (5)
[24: 0] sub_offset (25) -- 32 MB per slot
```
1 GB active decode.
| mcpu_sub_unit | Name | Budget |
|---------------|------|--------|
| 0 | MCPU_ITCM | 512 KB |
| 1 | MCPU_DTCM | 512 KB |
| 2 | IPCQ | 256 KB |
| 3 | MCPU_SFR | 8 KB |
| 4 | MCPU_DMA_SFR | 16 KB |
| 5 | MCPU_SRAM | 10 MB |
| 6..31 | Reserved | -- |
##### 2.3.5 CUBE_SRAM (resource_kind = 010)
```text
[33:25] MBZ (9)
[24: 0] sram_offset (25) -- flat 32 MB
```
#### 2.4 IOCHIPLET Die Layout
Only lower 1 TB of the 4 TB die-local window is assigned.
```text
[41:40] MBZ (2)
[39: 0] chiplet_offset (40) -- 1 TB
```
Region split by address range:
| Range | Meaning | Decode condition |
|-------|---------|------------------|
| [0, 2 GB) | IOCPU resource | chiplet_offset < 0x8000_0000 |
| [2 GB, 1 TB) | UAL | chiplet_offset >= 0x8000_0000 |
##### 2.4.1 IOCPU Region
```text
[30:27] iocpu_sub_unit (4)
[26: 0] sub_offset (27) -- 128 MB per slot
```
16 x 128 MB slots. 2 GB active decode.
| iocpu_sub_unit | Name | Budget |
|----------------|------|--------|
| 0 | IOCPU_ITCM | 512 KB |
| 1 | IOCPU_DTCM | 512 KB |
| 2 | IPCQ | 2 MB |
| 3 | IOCPU_SFR | 8 KB |
| 4 | IO_DMA_SFR | 16 KB |
| 5 | IO_SRAM | 64 MB |
| 6..15 | Reserved | -- |
##### 2.4.2 UAL Region
Sub-layout TBD (separate ADR).
#### 2.5 Addressing Rules
1. MBZ bits must be zero. An address with non-zero MBZ bits is
**architecturally invalid**. Implementation may raise a decode fault
or return an error -- behavior is not prescribed by this ADR.
2. Fixed slot sizes are chosen for simple hardware decode; actual
implemented capacity may be smaller than the slot.
3. Access beyond a sub-unit's implemented budget within a slot is
**architecturally invalid** (same policy as MBZ).
### D3. Bitfield decoding is deterministic
Given an integer address, field extraction (`sip_id`, `die_id`, `kind`,
`sub_unit`, `offset`) is purely positional. No runtime state is required.
Decoding deterministically maps an integer address to destination domains:
`sip_id`, `die_id`, target kind (HBM / PE_LOCAL / MCPU_LOCAL / CUBE_SRAM /
IOCPU / UAL).
### D4. Capacity validation may depend on topology config
Whether a decoded address falls within **implemented capacity** (e.g.,
HBM 96 GB on a specific SKU) is checked against topology parameters
provided via DI/config. Decode itself (D3) never consults topology --
only validation does. These parameters must live in the topology/config
layer, not in node implementations.
### D5. Routing consumes decoded domains, not raw bits
Routing policy uses decoded domains:
- `src` location (sip/cube/pe or node_id)
- `src` location (sip / die / pe or node_id)
- `dst` domains derived from PhysAddr decoding
- `size_bytes` for size-aware link latency
Routing must not inspect raw bit-fields directly except inside the decoding module.
Routing must not inspect raw bit-fields directly except inside the
decoding module.
## Alternatives Considered
1) **Use raw integers everywhere, decode ad-hoc in routing**
1. **Keep `rack_id` (4 bits)**: Rejected -- never used in practice,
consumes 4 bits that enable die-local expansion to 42 bits
(IOCHIPLET 1 TB).
- Rejected: leads to duplicated logic, inconsistent routing, and hidden assumptions embedded in multiple components.
2. **Uniform 256 GB per die**: Rejected -- IOCHIPLET UAL requires ~1 TB.
Freed rack_id bits enable 42-bit local_offset.
1) **Hardcode topology sizes (SIP/CUBE/PE counts) into decoding**
3. **Variable-width die windows (AHBM 256 GB, CHIPLET 1 TB via multi-seg
spanning)**: Rejected -- complicates D3 (deterministic decoding).
Uniform 4 TB window with MBZ padding is simpler.
- Rejected: violates SPEC (R3) and breaks swappability and configuration-driven topologies.
4. **Use raw integers everywhere, decode ad-hoc in routing**: Rejected --
leads to duplicated logic, inconsistent routing, and hidden
assumptions.
1) **Put decoding inside memory controllers or routers**
5. **Hardcode topology sizes (SIP/CUBE/PE counts) into decoding**:
Rejected -- violates SPEC R3 and breaks swappability.
- Rejected: leaks policy into components and undermines DI-first, swappable implementations (SPEC R4).
6. **Put decoding inside memory controllers or routers**: Rejected --
leaks policy into components, violates SPEC R4 / D5.
## Consequences
### Positive
- Deterministic routing domains enable clear test invariants for local vs remote paths (SPEC R1, R5).
- Keeps topology variability (SPEC R3) while preserving consistent semantics.
- DI-first: decoder can be swapped or extended without changing components or tests (SPEC R4).
- Simple hierarchical decoder: SIP -> die -> kind -> sub-unit.
- Clean separation of memory (HBM) vs local resource (PE/MCPU/SRAM/IOCPU).
- Deterministic routing domains enable clear test invariants (SPEC R1, R5).
- Expandable: 11 reserved die_id slots, reserved resource_kind / sub-unit
slots, reserved MBZ bits.
- DI-first: decoder can be swapped without changing components (SPEC R4).
### Tradeoffs / Costs
### Tradeoffs
- Requires explicit configuration for any topology-derived sizes.
- Introduces a single “blessed” decoding module that must remain stable and well-tested.
- Sparse address holes due to power-of-2 slot alignment.
- Large reserved/MBZ regions (intentional for future extension).
- Requires explicit configuration for topology-derived sizes (D4).
- Introduces a single "blessed" decoding module that must remain stable
and well-tested.
## Supersedes
- **ADR-0031 (PhysAddr PE-Resource Extension)**: stub status. The
PE_LOCAL / MCPU_LOCAL / CUBE_SRAM sub-unit tables in D2.3.3-D2.3.5
fulfill ADR-0031's stated goals.
## Implementation Notes (Non-normative)
- Recommended module boundary:
- `src/kernbench/policy/address/phyaddr.py`
- Recommended module: `src/kernbench/policy/address/phyaddr.py`
- Tests should cover: encode/decode round-trip per kind, MBZ enforcement,
die_id dispatch (AHBM / IOCHIPLET / reserved), sub-unit boundary
values, backward compatibility of factory APIs.
- Factory methods: `hbm_addr`, `pe_hbm_addr`, `pe_tcm_addr`,
`cube_sram_addr` retain signatures (minus `rack_id`); `cube_id`
parameter renamed to `die_id`.
- New factories: `pe_resource_addr`, `mcpu_resource_addr`,
`iocpu_resource_addr`, `ual_addr`.
- Tests should cover:
- deterministic decoding
- local vs remote classification from decoded fields
- invariants: “allocator returns full PhysAddr”, “decoding requires no global state”
## Appendix A. Address Examples
### A.1 AHBM HBM access
sip=2, die=5, HBM offset=0x1000
```text
sip_id = 2 -> [50:47] = 0b0010
die_id = 5 -> [46:42] = 0b00101
addr_space = 1 -> [37] = 1 (HBM)
hbm_offset = 0x1000 -> [36:0]
51-bit addr = (2 << 47) | (5 << 42) | (1 << 37) | 0x1000
```
### A.2 AHBM PE_LOCAL -- PE3 PE_TCM, offset=0x400
```text
sip_id = 0 -> [50:47] = 0
die_id = 0 -> [46:42] = 0
addr_space = 0 -> [37] = 0
resource_kind = 0 -> [36:34] = 000 (PE_LOCAL)
pe_id = 3 -> [32:29] = 0011
pe_sub_unit = 6 -> [28:25] = 0110 (PE_TCM)
sub_offset = 0x400 -> [24:0]
local_offset = (0 << 34) | (3 << 29) | (6 << 25) | 0x400
```
### A.3 AHBM MCPU_LOCAL -- MCPU_SRAM, offset=0x0
```text
sip_id = 1 -> [50:47] = 0001
die_id = 3 -> [46:42] = 00011
addr_space = 0 -> [37] = 0
resource_kind = 1 -> [36:34] = 001 (MCPU_LOCAL)
mcpu_sub_unit = 5 -> [29:25] = 00101 (MCPU_SRAM)
sub_offset = 0 -> [24:0] = 0
local_offset = (1 << 34) | (5 << 25)
```
### A.4 IOCHIPLET -- IOCPU IPCQ, offset=0x20000
```text
sip_id = 1 -> [50:47] = 0001
die_id = 17 -> [46:42] = 10001 (IOCHIPLET[1])
iocpu_sub_unit = 2 -> [30:27] = 0010 (IPCQ)
sub_offset = 0x20000 -> [26:0]
chiplet_offset = (2 << 27) | 0x20000
(< 0x8000_0000 -> IOCPU region)
```
### A.5 IOCHIPLET -- UAL region, offset=4 GB
```text
sip_id = 0 -> [50:47] = 0
die_id = 16 -> [46:42] = 10000 (IOCHIPLET[0])
chiplet_offset = 0x1_0000_0000 (4 GB >= 2 GB -> UAL region)
```
## Links
- SPEC.md: R1 (routing), R3 (configurable topology), R4 (DI-first), R5 (multi-domain comm)
- SPEC.md: R1 (routing), R3 (configurable topology), R4 (DI-first),
R5 (multi-domain comm)
- ADR-0031: Superseded
@@ -67,6 +67,76 @@ Completion semantics:
---
### D5. Launch timing is endpoint-synchronized
All PEs targeted by a single kernel launch MUST begin executing the kernel
body at the same simulated time, regardless of their dispatch path length
from the launch entry point.
Rationale. The dispatch tree Host → IO_CPU → M_CPU → PE_CPU has variable
latency at every level. PEs near their M_CPU receive the launch earlier
than PEs farther away; cubes near an IO_CPU receive it earlier than cubes
farther away. Without synchronization, each PE's kernel begins at a
different `env.now`, making per-PE metrics such as `pe_exec_ns` a function
of dispatch-path geometry rather than of the kernel's behavior —
producing measurement artifacts in benchmarks that time kernel-internal
waits (for example `tl.recv` on cross-cube or cross-SIP hops).
Mechanism.
- `KernelLaunchMsg` carries an optional `target_start_ns: float | None`.
- **IO_CPU** is the canonical stamper. On fan-out to M_CPUs, it
computes `target_start_ns = env.now + max_latency` where
`max_latency` is the maximum, over every target (sip, cube, pe)
tuple, of the **two-leg dispatch chain**:
```
max_latency(sip, cube, pe) =
compute_path_latency_ns(find_node_path(io_cpu, m_cpu(sip, cube)))
+ compute_path_latency_ns(find_node_path(m_cpu(sip, cube), pe_cpu))
- io_cpu.overhead_ns
- m_cpu.overhead_ns
```
This models the actual dispatch as **two sequential Transactions**
(IO_CPU → M_CPU, then M_CPU → PE_CPU). Each leg's
`compute_path_latency_ns` adds its endpoints' `overhead_ns`;
`io_cpu.overhead_ns` is subtracted because IO_CPU has already
paid it before this method runs, and `m_cpu.overhead_ns` is
subtracted once because it appears as endpoint of leg1 *and*
start of leg2 but is paid only once at run time. A single
`find_node_path(io_cpu, pe_cpu)` walk is **not** equivalent —
it can pick a graph path that bypasses M_CPU and silently
under-shoots the prediction for far cubes, breaking the D5
invariant.
The fanned-out sub-Transactions carry **`nbytes = 0`** for
`KernelLaunchMsg` (control message only). Without this,
large kernel-launch payloads would occupy fabric BW on the
shared first hop and serialize the per-cube dispatch, pushing
far M_CPUs past `target_start_ns` and re-introducing the
late-arrival violation.
- **M_CPU** passes an already-stamped `target_start_ns` through
unchanged. Only when the value is absent (e.g. a direct
launch-to-M_CPU unit test) does M_CPU compute a per-cube barrier
`env.now + max(local command-path latency)`.
- **PE_CPU** yields `env.timeout(target_start_ns - env.now)` at the top
of `_execute_kernel`, before recording `pe_exec_start` and invoking
the kernel body.
- When `target_start_ns is None`, PE_CPU falls through to the legacy
unsynchronized behavior — preserving backward compatibility.
IO_CPU-level stamping guarantees every PE across every targeted cube
uses the same barrier sim-time, eliminating both the within-cube
dispatch-offset artifact *and* the cross-cube offset artifact in
multi-cube launches. Models a real-hardware timed-broadcast launch
(latency-equalized dispatch tree).
The synchronization is internal to the engine / IO_CPU / M_CPU / PE_CPU
control plane — runtime API and application kernels are unchanged.
---
## Links
- SPEC R1, R2, R7, R8
+65 -11
View File
@@ -372,24 +372,41 @@ When the receiver frees a slot, the sender must learn about it
travel through general vc_comm fabric — it uses a **separate fast
path**, an abstraction of the NVLink / UCIe credit-return wire.
**Latency** is computed from the **bottleneck BW on the path**, not a
magic constant:
**Latency** is computed from the **full path latency** (per-node
overhead + edge propagation + drain), not a magic constant:
```
credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes)
path = router.find_path(self_pe, peer_pe)
latency = compute_drain_ns(path, credit_size_bytes)
= credit_size_bytes / bottleneck_bw_on_path
path = router.find_path(self_pe, peer_pe.pe_dma)
latency = compute_path_latency_ns(path, credit_size_bytes)
= sum(edge.distance_mm * ns_per_mm)
+ sum(node_overhead_ns[n] for n in path)
+ credit_size_bytes / bottleneck_bw_on_path
```
The router auto-appends `.pe_dma` to the source only, so the
destination MUST be spelled with the explicit `.pe_dma` suffix or
`find_path` raises and the credit silently teleports at zero cost
(latent bug fixed alongside this update).
`tl.recv` blocks on the credit-emit completion (recv yields-from
`_delayed_credit_send` rather than spawning it as a fork). This puts
the credit-return cost on the receiver's `pe_exec_ns`, modeling the
IPCQ control-plane completing the consume-acknowledgement before
recv returns to the kernel — the protocol equivalent of a non-posted
`tl.store` waiting for an HBM ack on the raw DMA path.
That gives us:
- **Topology-proportional approximation**: an in-cube credit return is
automatically faster than a cross-SIP credit return.
- **No magic constants**: no arbitrary `ipcq_ctrl_latency_ns`.
- **No magic constants**: every nanosecond comes from
`compute_path_latency_ns` on the same edge_map and `node_overhead_ns`
as data traffic.
- **No deadlock risk**: unlike piggyback, B can issue credit even when
it has no data to send back.
- **Reuses existing utility**: `ComponentContext.compute_drain_ns`.
it has no data to send back. `peer_credit_store.put` is unbounded.
- **`IPCQ ≥ raw DMA`** for matched physical moves — the credit-emit
cost on recv balances the HBM ack-trip cost RAW pays on the sender.
#### Component coupling — SimPy Store channel
@@ -420,11 +437,21 @@ fan-out (see `IpcqInitMsg` in D12).
#### PE_DMA's added responsibility
When `vc_comm` receives a token, PE_DMA processes it as the following
**atomic** sequence. **No SimPy yield is allowed between the two steps**
(invariant I6):
sequence: pay the Transaction's terminal BW drain, then atomically
write data and forward metadata. **No SimPy yield is allowed between
the data write and the metadata forward** (invariant I6). The drain
yield must sit before the atomic block, not inside it:
```python
def _on_vc_comm_recv(self, env, token):
def _on_vc_comm_recv(self, env, txn):
# Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the
# sender PE_DMA). MUST happen before the atomic block so recv only
# wakes after the bytes have "landed".
drain = getattr(txn, "drain_ns", 0.0)
if drain > 0:
yield env.timeout(drain)
token = txn.request
# ── ATOMIC: no yield between these two operations ──
data = self._memory_store.read(token.src_space, token.src_addr,
shape=..., dtype=...)
@@ -439,6 +466,33 @@ The final `put` is yieldable but uses an unbounded internal store, so
it completes in a single step. That `put` is the closing call of the
atomic block; nothing may be inserted before it.
#### Drain-at-inbound semantics (D9 timing model)
The Transaction carries `drain_ns = nbytes / bottleneck_bw_on_path`
stamped at send-side PE_DMA. In this simulator per-hop `overhead_ns`
is paid at each forwarding component via `run()`, and the remaining
BW drain is paid once at the Transaction's terminal. Every non-IPCQ
Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via
`ComponentBase._forward_txn` at the terminal node. For IPCQ the
destination PE_DMA intercepts the Transaction with `_handle_ipcq_inbound`
(so IPCQ-specific data write + metadata forward can happen), so **the
drain MUST be paid explicitly at the top of that handler** to keep
IPCQ's timing model on par with every other fabric Transaction.
Side-effects of paying drain here:
- **SRC `tl.send`** is unchanged — fire-and-forget semantics are
preserved because the sender PE_DMA does not `yield sub_done`. The
`sub_done.succeed()` call (made after metadata forward below) is an
event with no listener on the sender side.
- **DST `tl.recv`** unblocks `drain_ns` later. Since recv wakes only
when `IpcqMetaArrival` reaches its local PE_IPCQ, and the metadata
forward now happens after the drain, recv observes the full fabric
transfer time including bandwidth cost.
Matches the physical picture: send dispatches and leaves; recv waits
until the bytes have actually been drained into its inbox.
### D9.5. ADR-0020 (2-pass) integration
`tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase
+66 -13
View File
@@ -365,23 +365,39 @@ data 경로의 piggyback 모델과 달리, credit return은 일반 vc_comm fabri
거치지 않고 **별도 fast path**로 처리한다. 이는 실제 HW의 NVLink/UCIe
credit return fast path를 추상화한 것이다.
**Latency 계산**: magic constant가 아니라 **라우팅 경로의 bottleneck BW**
기준으로 산출한다.
**Latency 계산**: magic constant가 아니라 **라우팅 경로의 full path
latency** (per-node overhead + edge propagation + drain) 기준으로
산출한다.
```
credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes)
path = router.find_path(self_pe, peer_pe)
latency = compute_drain_ns(path, credit_size_bytes)
= credit_size_bytes / bottleneck_bw_on_path
path = router.find_path(self_pe, peer_pe.pe_dma)
latency = compute_path_latency_ns(path, credit_size_bytes)
= sum(edge.distance_mm * ns_per_mm)
+ sum(node_overhead_ns[n] for n in path)
+ credit_size_bytes / bottleneck_bw_on_path
```
router는 source에만 `.pe_dma`를 자동 부여하므로 destination에는 반드시
`.pe_dma` suffix를 명시해야 한다. 그렇지 않으면 `find_path`가 raise하고
credit이 0 cost로 silently teleport되는 latent bug가 발생한다 (이번
업데이트에서 수정됨).
`tl.recv`는 credit-emit 완료를 yield-from으로 기다린다 (이전에는
`env.process`로 fork). 이로써 credit-return cost가 receiver의
`pe_exec_ns`에 반영되어, IPCQ control-plane이 consume-acknowledgement를
완료한 뒤에야 recv가 kernel에 반환된다 — RAW DMA의 non-posted `tl.store`
HBM ack-trip을 기다리는 것의 protocol-level 등가물이다.
이로써:
- **토폴로지 비례 approximation**: cube 내 credit return과 cross-SIP credit이
자동으로 다른 latency를 가짐 (정확한 값은 아니지만 magic constant보다 의미 있음)
- **Magic constant 없음**: 별도 `ipcq_ctrl_latency_ns` 같은 임의 값 불필요
- **Deadlock 위험 없음**: piggyback과 달리 B가 A에게 보낼 데이터가 없어도
credit이 자동 발행됨
- **기존 utility 재사용**: `ComponentContext.compute_drain_ns` 그대로 사용
자동으로 다른 latency를 가짐
- **Magic constant 없음**: 모든 ns 값이 데이터 트래픽과 동일한 edge_map
`node_overhead_ns`에서 산출되는 `compute_path_latency_ns`로부터 옴
- **Deadlock 위험 없음**: `peer_credit_store.put`은 unbounded, B가 A에게
보낼 데이터가 없어도 credit이 자동 발행됨
- **`IPCQ ≥ raw DMA`** 보장: matched physical move에 대해 credit-emit이
RAW의 ack-trip cost와 균형을 이룸
```
PE B: tl.recv(W) → 데이터 가져감 → my_tail++
@@ -426,11 +442,22 @@ backend init에서 IpcqInitMsg fan-out 시 양방향 fast path channel을 함께
#### PE_DMA의 책임 추가
PE_DMA(vc_comm)는 token 수신 시 다음 atomic 시퀀스로 처리한다.
**두 동작 사이에 SimPy yield를 두어서는 안 된다** (I6 MUST 규칙 참조):
PE_DMA(vc_comm)는 token 수신 시 다음 시퀀스로 처리한다: Transaction
terminal의 BW drain을 먼저 지불하고, 이어서 atomic하게 data write +
metadata forward 수행. **data write와 metadata forward 사이에는 SimPy
yield를 두어서는 안 된다** (I6 MUST 규칙 참조). drain yield는 atomic
구간 안이 아니라 그 앞에 위치해야 한다:
```python
def _on_vc_comm_recv(self, env, token):
def _on_vc_comm_recv(self, env, txn):
# Sender PE_DMA가 찍어 둔 drain_ns (= nbytes / bottleneck_bw) 를
# 여기서 지불. atomic 구간보다 앞이어야 한다 — recv는 bytes가
# "도착"한 이후에만 깨어나야 하므로.
drain = getattr(txn, "drain_ns", 0.0)
if drain > 0:
yield env.timeout(drain)
token = txn.request
# ── ATOMIC: 두 동작 사이에 yield 금지 ──
# 1. data를 dst_addr에 write (dst의 메모리 공간은 token.dst_endpoint.buffer_kind)
data = self._memory_store.read(token.src_space, token.src_addr,
@@ -446,6 +473,32 @@ wire로 capacity가 unbounded인 store를 사용하므로 즉시 완료된다 (
single-step). 이 최종 put이 atomic 구간의 끝이며, 그 이전에 다른 yield가
삽입되면 안 된다.
#### Drain-at-inbound semantics (D9 timing model)
Transaction은 sender PE_DMA가 `drain_ns = nbytes / bottleneck_bw_on_path`
를 찍어 둔 상태로 fabric에 들어간다. 이 simulator에서 per-hop `overhead_ns`
는 각 forwarding component의 `run()` 에서 지불되고, 남은 BW drain은
Transaction의 terminal node에서 한 번 지불된다. IPCQ가 아닌 모든
Transaction (raw DMA, kernel-launch fanout 등) 은
`ComponentBase._forward_txn` 이 terminal에서 이 drain을 지불한다. IPCQ의
경우 목적지 PE_DMA가 `_handle_ipcq_inbound` 핸들러로 Transaction을
가로채서 (IPCQ 전용 data write + metadata forward를 해야 하므로)
**이 핸들러 최상단에서 drain을 명시적으로 지불해야 한다** — 그래야 IPCQ의
timing model이 다른 모든 fabric Transaction과 동일선상에 놓인다.
여기서 drain을 지불할 때의 side-effect:
- **SRC `tl.send`**: 동작 불변. sender PE_DMA가 `sub_done``yield`
하지 않으므로 fire-and-forget 의미가 보존된다. metadata forward 이후
호출되는 `sub_done.succeed()` 는 sender 입장에서 listener가 없는 이벤트.
- **DST `tl.recv`**: `drain_ns` 만큼 늦게 깨어난다. recv는 local PE_IPCQ
`IpcqMetaArrival` 수신 시에만 wake되며, metadata forward가 drain
이후로 이동했으므로 recv는 bandwidth까지 포함한 전체 fabric transfer
시간을 관측하게 된다.
물리적 그림과 일치: send는 dispatch하고 바로 반환; recv는 bytes가 실제로
자신의 inbox로 drain될 때까지 대기.
#### Backpressure latency 정확도
backpressure 해제까지 걸리는 시간:
+8 -1
View File
@@ -2,7 +2,14 @@
## Status
Proposed (Revision 8 — Hierarchical content split out to ADR-0029)
Accepted. rank = SIP process-group model stands. The allreduce algorithm
path (mapper / validator / per-PE install machinery originally targeted at
ADR-0029) has been replaced by ADR-0032: `AhbmCCLBackend` now calls
`configure_sfr_intercube_multisip` at `init_process_group` time and the
intercube kernel receives `(sip_rank, sip_topo_kind, sip_topo_w,
sip_topo_h)` appended after the module's `kernel_args()`. The
`leader_only` / `all_pes` mapper concepts in this document are no longer
used by the default allreduce path.
## Context
@@ -89,7 +89,14 @@ direction_idx × bytes_per_direction). 따라서:
`src/kernbench/ccl/install.py`:
```python
_OPPOSITE_DIR = {"E": "W", "W": "E", "N": "S", "S": "N"}
# Extended in ADR-0032 with global_* pairs for inter-SIP directions,
# which were introduced by configure_sfr_intercube_multisip to keep
# intercube (N/S/E/W) and inter-SIP (global_N/S/E/W) namespaces disjoint.
_OPPOSITE_DIR = {
"E": "W", "W": "E", "N": "S", "S": "N",
"global_E": "global_W", "global_W": "global_E",
"global_N": "global_S", "global_S": "global_N",
}
def reverse_direction(my_rank: int, peer_rank: int, my_dir: str) -> str | None:
"""Find peer's direction that reciprocates my_dir→peer_rank.
+3 -1
View File
@@ -2,7 +2,9 @@
## Status
Proposed
Superseded by ADR-0032 (Intercube all-reduce). The 3-level kernel and
`hierarchical_allreduce.py` module have been removed. The cube-mesh
intercube + inter-SIP path is now the single all-reduce algorithm.
## Context
@@ -2,7 +2,11 @@
## Status
Stub (Blocker for ADR-0030 — specific range allocations TBD)
Superseded by ADR-0001 (Revision 2, 2026-04-27).
PE_LOCAL / MCPU_LOCAL / CUBE_SRAM sub-unit tables are now defined in
ADR-0001 D2.3.3-D2.3.5.
Previous status: Stub (Blocker for ADR-0030 — specific range allocations TBD)
## Context
+256
View File
@@ -0,0 +1,256 @@
# ADR-0032: Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange
## Status
Accepted (supersedes ADR-0029).
## Context
### Goal
Define a single all-reduce algorithm that exploits the topology hierarchy:
cube mesh within each SIP (intercube) + inter-SIP exchange. One kernel,
one SFR configuration path, driven by `topology.yaml` and `ccl.yaml`.
### Why replace ADR-0029 (hierarchical 3-level)
ADR-0029 proposed a 3-level (intra-cube → inter-cube → inter-SIP) algorithm
where every PE in the system participates. In practice this adds the
intra-cube PE-to-PE stage complexity (bidirectional reduce + chain broadcast)
without matching the common workload pattern where the tensor is sharded
**per cube** (not per PE within a cube).
Moreover, the hierarchical design required:
- per-PE neighbor graph installation (`_build_pe_installs` multi-level)
- multi-level topology schema (`hierarchical_3level`)
- `all_pes` mapper + `multi_pe_sip_local` validator infrastructure
The intercube algorithm below removes all of that: **pe0-only same-lane
intercube reduce on the 4×4 cube mesh**, then inter-SIP exchange on the
root cube, then broadcast back. Simpler kernel, simpler wiring, same
bandwidth characteristics for the common per-cube DP workload.
### Current state
- `src/kernbench/ccl/algorithms/intercube_allreduce.py` — kernel
- `src/kernbench/ccl/sfr_config.py``configure_sfr_intercube_multisip`
- `src/kernbench/runtime_api/distributed.py``AhbmCCLBackend` wires this
automatically at `init_process_group` time.
- Old `ring_allreduce`, `mesh_allreduce`, `tree_allreduce`,
`hierarchical_allreduce` modules and their tests are **removed**.
---
## Decision
### D1. Algorithm structure — 5 phases
For each SIP (launched concurrently by `mp.spawn`):
```
Phase 1 — Row reduce W → E (cube mesh, pe0 only):
col=0 sends E → col=1 accumulates, sends E → ... → col=3 holds row sum.
Phase 2 — Col reduce N → S on rightmost column (pe0, col = mesh_w-1):
row=0 sends S → row=1 accumulates, sends S → ... → root cube (15)
holds the full SIP sum.
Phase 3 — Inter-SIP exchange on root cube (pe0 of root cube only):
Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
selected by sip_topo_kind (from topology.yaml sips.topology).
Phase 4 — Col broadcast S → N on rightmost column.
Phase 5 — Row broadcast E → W across the cube mesh.
```
After all phases every cube's pe0 holds the global sum.
The kernel is a single function parameterised by `sip_topo_kind ∈ {0, 1, 2}`
(ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical
across topologies; only phase 3 branches. Helper functions
`_inter_sip_ring`, `_inter_sip_torus_2d`, `_inter_sip_mesh_2d` encode the
three exchange patterns.
### D2. Tensor layout (rank = SIP, per-worker)
Per ADR-0024 rank = SIP at the process-group level. Each worker allocates
its own cube-mesh-spanning tensor:
```python
dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=16, num_pes=1)
tensor = torch.zeros((n_cubes, n_elem), dtype="f16", dp=dp)
```
Shard layout: 16 shards per SIP, one per cube on pe0. The kernel addresses
each cube's shard as `pe_addr = t_ptr + cube_id * n_elem * 2`.
### D3. SFR / IPCQ wiring — `configure_sfr_intercube_multisip`
Replaces the rank-to-2-PE install from ADR-0024. Wires PE_IPCQ neighbor
tables for **every cube's pe0 across every SIP** — regardless of which
cube is the root or which SIP topology is selected. This lets the kernel
elect the root cube at runtime and supports topology switches without
re-wiring.
| Level | Direction labels | Scope |
|---|---|---|
| Intercube within SIP | N / S / E / W | pe0 of every cube → pe0 of mesh neighbors (no wrap) |
| Inter-SIP (all cubes) | global_E / global_W / global_N / global_S | pe0 of cube c on sip A → pe0 of cube c on peer SIP per `sips.topology` |
Inter-SIP directions use the `global_*` prefix to keep the namespace
disjoint from intercube directions. ADR-0025's `_OPPOSITE_DIR` is extended
with `global_E ↔ global_W` and `global_N ↔ global_S` so the reverse-
direction resolver handles 2-SIP bidirectional rings correctly.
Internally the function calls `install_ipcq` with:
- `world_size = n_sips × n_cubes`
- `rank_to_pe = [(sip, cube, 0) for sip in range(n_sips) for cube in range(n_cubes)]`
- A closure-captured `neighbors()` function that builds the map above.
This `world_size` is internal to IPCQ wiring and does not leak to the
process-group rank.
### D4. SIP topology — from `topology.yaml`
```yaml
system:
sips:
count: 2
topology: ring_1d # or torus_2d, mesh_2d_no_wrap
```
- `ring_1d`: n_sips-1 rounds of `send global_E / recv global_W`.
- `torus_2d`: sqrt(n_sips)×sqrt(n_sips) wrapping mesh. Row ring on
`global_E/W` then col ring on `global_S/N`.
- `mesh_2d_no_wrap`: square mesh without wrap-around. Chain reduce +
broadcast per dimension.
2D variants require `n_sips` to be a perfect square.
### D5. Process-group integration — `AhbmCCLBackend`
At `init_process_group` time the backend:
1. Loads `ccl.yaml` + `topology.yaml`.
2. Derives `sip_topo_kind, sip_topo_w, sip_topo_h` from
`system.sips.topology` using the algorithm module's `TOPO_NAME_TO_KIND`.
3. Calls `configure_sfr_intercube_multisip(engine, spec, cfg)` — one-time
SFR wiring, mirrors NCCL communicator creation.
At each `dist.all_reduce(tensor)` call:
1. Resolves `kernel_fn` from `cfg["module"]`.
2. Builds args: `(n_elem, cube_w, cube_h, n_sips)` from
`kernel_args(world_size, n_elem)`.
3. Appends `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` where
`sip_rank` is the current greenlet's bound rank.
4. Launches with `_defer_wait=True`; the main scheduler drains pending
handles after all workers submit (per ADR-0024 D7 / ADR-0027 D0.4).
### D6. Config schema
`ccl.yaml`:
```yaml
defaults:
algorithm: intercube_allreduce
buffer_kind: tcm
...
algorithms:
intercube_allreduce:
module: kernbench.ccl.algorithms.intercube_allreduce
topology: none
buffer_kind: tcm
n_elem: 8
root_cube: 15
```
`topology.yaml`:
```yaml
system:
sips:
count: 2
topology: ring_1d
sip:
cube_mesh: { w: 4, h: 4 }
```
### D7. Algorithm module contract
Modules loaded via `cfg["module"]` must export:
| Name | Purpose |
|---|---|
| `kernel` | callable, signature `(t_ptr, n_elem, cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl)` |
| `kernel_args(world_size, n_elem) -> tuple` | returns the first 4 scalar args (per-tensor) |
| `TOPO_NAME_TO_KIND: dict[str, int]` | maps `system.sips.topology` name to kernel branch code |
| `SIP_TOPO_RING`, `SIP_TOPO_TORUS`, `SIP_TOPO_MESH` | integer constants (0, 1, 2) |
---
## Dependencies
- **ADR-0023**: IPCQ protocol (neighbor table, send/recv, credit return).
- **ADR-0024**: rank = SIP launcher, `mp.spawn`, greenlet-local rank.
- **ADR-0025**: Address-based IPCQ direction matching; extended
`_OPPOSITE_DIR` with `global_*` pairs.
- **ADR-0027**: Worker-wait / collective-pending drain in main scheduler.
## Non-goals
- **Per-PE allreduce** (intra-cube PE-to-PE reduce). Out of scope — the
workload for this algorithm is per-cube DP.
- **Asymmetric SIP topologies** (non-square mesh/torus). `torus_2d` and
`mesh_2d_no_wrap` require `n_sips = k²`.
- **Pipelined chunks**: single-tile per cube, no pipelining yet.
- **Root cube runtime election**: the kernel currently uses
`root_cube = (mesh_h - 1) * mesh_w + (mesh_w - 1)` hardcoded to the SE
corner. SFR wiring covers all cubes, so runtime election is a pure kernel
change when needed.
---
## Consequences
### Positive
- **Single kernel, single install path** for all-reduce — replaces four
removed modules (`ring`, `mesh`, `tree`, `hierarchical`).
- **Topology-agnostic kernel**: ring / torus / mesh selected via one
integer param, no kernel duplication.
- **Automatic via `dist.all_reduce`**: no bench-level or user-level
algorithm selection needed; config-driven end-to-end.
- **Full SFR wiring**: every cube on every SIP has inter-SIP links
available — supports future dynamic root-cube election.
### Negative
- **Not suitable for per-PE sharded tensors**: TP-layer-style tensors that
shard within one cube across 8 PEs are not addressable by this kernel.
Such workloads would need a separate intra-cube all-reduce path (not
yet implemented).
- **`configure_sfr_intercube_multisip` always wires all pe0s**: even if a
given run only needs a subset (e.g. 1 SIP, ring only). Install cost is
small but not zero.
---
## Affected files
| File | Change |
|---|---|
| `src/kernbench/ccl/algorithms/intercube_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` |
| `src/kernbench/ccl/sfr_config.py` (new) | `configure_sfr_intercube_multisip` |
| `src/kernbench/ccl/topologies.py` | Added `torus_2d`, `mesh_2d_no_wrap` |
| `src/kernbench/ccl/install.py` | Extended `_OPPOSITE_DIR` with `global_*` pairs |
| `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend` uses `configure_sfr_intercube_multisip` + appends sip_rank/topo args |
| `ccl.yaml` | Single `intercube_allreduce` entry |
| `topology.yaml` | Added `system.sips.topology` |
| `benches/ccl_allreduce.py` | Row-wise cube-mesh tensor layout |
| `tests/test_allreduce_multidevice.py` (new) | Config-driven ring/torus/mesh |
| `tests/test_distributed_intercube_allreduce.py` (new) | Full `dist.all_reduce` path |
| `tests/test_intercube_sfr_config.py` (new) | SFR wiring verification |
| Removed | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` and their tests |
+1 -1
View File
@@ -6,7 +6,7 @@ build-backend = "setuptools.build_meta"
name = "kernbench"
version = "0.1.0"
requires-python = ">=3.10"
dependencies = ["pytest", "simpy", "pyyaml", "fastapi>=0.110", "uvicorn[standard]>=0.29", "websockets>=12", "numpy>=1.24", "greenlet>=3.0"]
dependencies = ["pytest", "simpy", "pyyaml", "fastapi>=0.110", "uvicorn[standard]>=0.29", "websockets>=12", "numpy>=1.24", "greenlet>=3.0", "matplotlib>=3.7"]
[project.scripts]
kernbench = "kernbench.cli.main:main"
@@ -24,9 +24,7 @@ TOPO_NAME_TO_KIND = {
}
def kernel_args(world_size: int, n_elem: int) -> tuple:
cube_w = 4
cube_h = 4
def kernel_args(world_size: int, n_elem: int, *, cube_w: int = 4, cube_h: int = 4) -> tuple:
return (n_elem, cube_w, cube_h, world_size)
@@ -127,61 +125,79 @@ def allreduce_intercube_multidevice(
row = cube_id // cube_w
col = cube_id % cube_w
nbytes = n_elem * 2
single_cube = (cube_w == 1 and cube_h == 1)
pe_addr = t_ptr + cube_id * nbytes
acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
# ── Phase 1: row reduce W → E ──
if col == 0:
tl.send(dir="E", src=acc)
elif col < cube_w - 1:
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
acc = acc + recv
tl.send(dir="E", src=acc)
if single_cube:
# ── Single-cube mode: skip intra-SIP reduce, go directly to
# inter-SIP exchange (TP use case: one cube per rank). ──
if n_sips > 1:
if sip_topo_kind == SIP_TOPO_RING:
acc = _inter_sip_ring(acc, n_sips, n_elem, tl)
elif sip_topo_kind == SIP_TOPO_TORUS:
acc = _inter_sip_torus_2d(
acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
elif sip_topo_kind == SIP_TOPO_MESH:
acc = _inter_sip_mesh_2d(
acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
else:
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
acc = acc + recv
# ── Multi-cube mode: full mesh reduce + inter-SIP + broadcast ──
# ── Phase 2: col reduce NS on rightmost column ──
if col == cube_w - 1:
if row == 0:
tl.send(dir="S", src=acc)
elif row < cube_h - 1:
recv = tl.recv(dir="N", shape=(n_elem,), dtype="f16")
# Phase 1: row reduce WE
if col == 0:
tl.send(dir="E", src=acc)
elif col < cube_w - 1:
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
acc = acc + recv
tl.send(dir="S", src=acc)
tl.send(dir="E", src=acc)
else:
recv = tl.recv(dir="N", shape=(n_elem,), dtype="f16")
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
acc = acc + recv
# ── Phase 3: inter-SIP exchange on root cube ──
root_cube = (cube_h - 1) * cube_w + (cube_w - 1)
if cube_id == root_cube and n_sips > 1:
if sip_topo_kind == SIP_TOPO_RING:
acc = _inter_sip_ring(acc, n_sips, n_elem, tl)
elif sip_topo_kind == SIP_TOPO_TORUS:
acc = _inter_sip_torus_2d(acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
elif sip_topo_kind == SIP_TOPO_MESH:
acc = _inter_sip_mesh_2d(acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
# Phase 2: col reduce N → S on rightmost column
if col == cube_w - 1:
if row == 0:
tl.send(dir="S", src=acc)
elif row < cube_h - 1:
recv = tl.recv(dir="N", shape=(n_elem,), dtype="f16")
acc = acc + recv
tl.send(dir="S", src=acc)
else:
recv = tl.recv(dir="N", shape=(n_elem,), dtype="f16")
acc = acc + recv
# ── Phase 4: col broadcast S → N on rightmost column ──
if col == cube_w - 1:
if row == cube_h - 1:
tl.send(dir="N", src=acc)
elif row > 0:
acc = tl.recv(dir="S", shape=(n_elem,), dtype="f16")
tl.send(dir="N", src=acc)
# Phase 3: inter-SIP exchange on root cube
root_cube = (cube_h - 1) * cube_w + (cube_w - 1)
if cube_id == root_cube and n_sips > 1:
if sip_topo_kind == SIP_TOPO_RING:
acc = _inter_sip_ring(acc, n_sips, n_elem, tl)
elif sip_topo_kind == SIP_TOPO_TORUS:
acc = _inter_sip_torus_2d(
acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
elif sip_topo_kind == SIP_TOPO_MESH:
acc = _inter_sip_mesh_2d(
acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
# Phase 4: col broadcast S → N on rightmost column
if col == cube_w - 1:
if row == cube_h - 1:
tl.send(dir="N", src=acc)
elif row > 0:
acc = tl.recv(dir="S", shape=(n_elem,), dtype="f16")
tl.send(dir="N", src=acc)
else:
acc = tl.recv(dir="S", shape=(n_elem,), dtype="f16")
# Phase 5: row broadcast E → W
if col == cube_w - 1:
tl.send(dir="W", src=acc)
elif col > 0:
acc = tl.recv(dir="E", shape=(n_elem,), dtype="f16")
tl.send(dir="W", src=acc)
else:
acc = tl.recv(dir="S", shape=(n_elem,), dtype="f16")
# ── Phase 5: row broadcast E → W ──
if col == cube_w - 1:
tl.send(dir="W", src=acc)
elif col > 0:
acc = tl.recv(dir="E", shape=(n_elem,), dtype="f16")
tl.send(dir="W", src=acc)
else:
acc = tl.recv(dir="E", shape=(n_elem,), dtype="f16")
acc = tl.recv(dir="E", shape=(n_elem,), dtype="f16")
tl.store(pe_addr, acc)
+2
View File
@@ -221,6 +221,8 @@ def install_ipcq(
_OPPOSITE_DIR = {
"E": "W", "W": "E", "N": "S", "S": "N",
"intra_E": "intra_W", "intra_W": "intra_E",
"intra_N": "intra_S", "intra_S": "intra_N",
"global_E": "global_W", "global_W": "global_E",
"global_N": "global_S", "global_S": "global_N",
}
+95 -36
View File
@@ -1,22 +1,24 @@
"""SFR configuration for intercube + inter-SIP IPCQ wiring.
"""SFR configuration for the full IPCQ hardware wiring.
Provides ``configure_sfr_intercube_multisip`` which programs PE_IPCQ
neighbor tables for:
Installs PE_IPCQ neighbor tables modeling the physical hardware.
Wiring is independent of DPPolicy / kernel choice — the kernel decides
at runtime which links to use.
1. Intercube within each SIP — pe0 of every cube connects to pe0 of
its N/S/E/W mesh neighbors (no wrap-around).
2. Inter-SIP on ALL cubes — pe0 of cube_c on sip_A connects to pe0 of
cube_c on each peer SIP, using ``global_E``/``global_W`` (ring) or
``global_N``/``global_S``/``global_E``/``global_W`` (mesh/torus)
direction labels. Wiring all cubes allows the kernel to
dynamically elect the root cube at runtime.
Direction label namespaces (disjoint):
SIP-level topology is read from ``topology.yaml`` →
``system.sips.topology`` (e.g. ``ring_1d``, ``mesh_2d``).
Intercube mesh dimensions come from ``sip.cube_mesh.w/h``.
- Intra-cube PE-to-PE: ``intra_N / intra_S / intra_E / intra_W``
Logical 2×4 PE grid within a cube (no wrap):
Internally delegates to ``install_ipcq`` with a computed ``rank_to_pe``
(pe0-only) and a closure-captured ``neighbors()`` function.
Row 0: pe0 pe1 pe2 pe3
Row 1: pe4 pe5 pe6 pe7
- Intercube same-lane: ``N / S / E / W``
``pe_i of cube_A ↔ pe_i of cube_B`` across the 4×4 cube mesh
(no wrap). Every PE i ∈ [0..7] wired independently.
- Inter-SIP same-(cube, pe): ``global_N / global_S / global_E / global_W``
``pe_i of cube_c on sip_A ↔ pe_i of cube_c on sip_B`` per
``topology.yaml → system.sips.topology``.
"""
from __future__ import annotations
@@ -27,12 +29,46 @@ from kernbench.ccl.install import install_ipcq
from kernbench.ccl.topologies import _BUILTIN as _TOPO_BUILTINS
# ── Intra-cube 2×4 PE grid ───────────────────────────────────────────
_PE_GRID_COLS = 4
_PE_GRID_ROWS = 2
_PES_PER_CUBE = _PE_GRID_COLS * _PE_GRID_ROWS # 8
def _intra_cube_neighbors(pe: int) -> dict[str, int]:
"""Logical 2×4 PE grid neighbors within a cube (no wrap).
Returns directions in the ``intra_*`` namespace.
"""
row, col = divmod(pe, _PE_GRID_COLS)
nbrs: dict[str, int] = {}
if col < _PE_GRID_COLS - 1:
nbrs["intra_E"] = row * _PE_GRID_COLS + (col + 1)
if col > 0:
nbrs["intra_W"] = row * _PE_GRID_COLS + (col - 1)
if row < _PE_GRID_ROWS - 1:
nbrs["intra_S"] = (row + 1) * _PE_GRID_COLS + col
if row > 0:
nbrs["intra_N"] = (row - 1) * _PE_GRID_COLS + col
return nbrs
# ── Public entry point ───────────────────────────────────────────────
def configure_sfr_intercube_multisip(
engine: Any,
spec: dict,
cfg: dict,
) -> dict[str, Any]:
"""Wire IPCQ for intercube (pe0, mesh) + inter-SIP (pe0, all cubes).
"""Wire the full IPCQ hardware model.
Every PE on every cube on every SIP gets neighbor table entries for:
- intra-cube (2×4 grid) in the ``intra_*`` namespace
- intercube same-lane (4×4 cube mesh, no wrap) in ``N/S/E/W``
- inter-SIP same-(cube, pe) in ``global_*``
Args:
engine: GraphEngine with ``_components``.
@@ -46,48 +82,71 @@ def configure_sfr_intercube_multisip(
mesh_w = int(cm["w"])
mesh_h = int(cm["h"])
n_cubes = mesh_w * mesh_h
n_sips = int(spec.get("system", {}).get("sips", {}).get("count", 1))
sip_topology = str(
spec.get("system", {}).get("sips", {}).get("topology", "ring_1d")
)
sips_cfg = spec.get("system", {}).get("sips", {})
n_sips = int(sips_cfg.get("count", 1))
sip_topology = str(sips_cfg.get("topology", "ring_1d"))
sip_w = sips_cfg.get("w")
sip_h = sips_cfg.get("h")
sip_w = int(sip_w) if sip_w is not None else None
sip_h = int(sip_h) if sip_h is not None else None
if sip_topology not in _TOPO_BUILTINS:
raise ValueError(
f"Unknown sip topology '{sip_topology}'. "
f"Available: {list(_TOPO_BUILTINS)}"
)
sip_topo_fn = _TOPO_BUILTINS[sip_topology]
_sip_topo_fn_raw = _TOPO_BUILTINS[sip_topology]
world_size = n_sips * n_cubes
def sip_topo_fn(rank: int, ws: int) -> dict:
if sip_w is not None and sip_h is not None:
try:
return _sip_topo_fn_raw(rank, ws, w=sip_w, h=sip_h)
except TypeError:
pass
return _sip_topo_fn_raw(rank, ws)
pes_per_cube = _PES_PER_CUBE
world_size = n_sips * n_cubes * pes_per_cube
pe_idx_to_pe: list[tuple[int, int, int]] = [
(sip, cube, 0)
(sip, cube, pe)
for sip in range(n_sips)
for cube in range(n_cubes)
for pe in range(pes_per_cube)
]
def _pe_idx(sip: int, cube: int, pe: int) -> int:
return (sip * n_cubes + cube) * pes_per_cube + pe
def _neighbors(pe_idx: int, ws: int, _base: dict) -> dict[str, int]:
sip = pe_idx // n_cubes
cube = pe_idx % n_cubes
tmp = pe_idx
pe = tmp % pes_per_cube
tmp //= pes_per_cube
cube = tmp % n_cubes
sip = tmp // n_cubes
row = cube // mesh_w
col = cube % mesh_w
nbrs: dict[str, int] = {}
# Intercube within SIP (mesh, no wrap-around)
if col < mesh_w - 1:
nbrs["E"] = sip * n_cubes + (row * mesh_w + col + 1)
if col > 0:
nbrs["W"] = sip * n_cubes + (row * mesh_w + col - 1)
if row < mesh_h - 1:
nbrs["S"] = sip * n_cubes + ((row + 1) * mesh_w + col)
if row > 0:
nbrs["N"] = sip * n_cubes + ((row - 1) * mesh_w + col)
# ── Intra-cube (intra_N/S/E/W) ──
for d, peer_pe in _intra_cube_neighbors(pe).items():
nbrs[d] = _pe_idx(sip, cube, peer_pe)
# Inter-SIP on ALL cubes
# ── Intercube same-lane (N/S/E/W, 4×4 no wrap) ──
if col < mesh_w - 1:
nbrs["E"] = _pe_idx(sip, row * mesh_w + (col + 1), pe)
if col > 0:
nbrs["W"] = _pe_idx(sip, row * mesh_w + (col - 1), pe)
if row < mesh_h - 1:
nbrs["S"] = _pe_idx(sip, (row + 1) * mesh_w + col, pe)
if row > 0:
nbrs["N"] = _pe_idx(sip, (row - 1) * mesh_w + col, pe)
# ── Inter-SIP same-(cube, pe) (global_*) ──
if n_sips > 1:
sip_nbrs = sip_topo_fn(sip, n_sips)
for d, peer_sip in sip_nbrs.items():
nbrs[f"global_{d}"] = peer_sip * n_cubes + cube
nbrs[f"global_{d}"] = _pe_idx(peer_sip, cube, pe)
return nbrs
+49 -37
View File
@@ -33,23 +33,41 @@ def ring_1d_unidir(rank: int, world_size: int) -> NeighborMap:
return {"E": (rank + 1) % world_size}
def mesh_2d(rank: int, world_size: int) -> NeighborMap:
"""Square 2D mesh (N/S/E/W).
Layout: rank = row * side + col, with side = sqrt(world_size).
Wrap-around (torus) on all four edges.
"""
def _resolve_2d_dims(
world_size: int, w: int | None, h: int | None, name: str,
) -> tuple[int, int]:
if w is not None and h is not None:
if w * h != world_size:
raise ValueError(
f"{name}: w*h ({w}*{h}) != world_size ({world_size})"
)
return w, h
side = int(round(world_size ** 0.5))
if side * side != world_size:
raise ValueError(
f"mesh_2d requires square world_size, got {world_size}"
f"{name} requires square world_size or explicit w,h, "
f"got {world_size}"
)
r, c = divmod(rank, side)
return side, side
def mesh_2d(
rank: int, world_size: int,
w: int | None = None, h: int | None = None,
) -> NeighborMap:
"""2D mesh (N/S/E/W) with wrap-around on all four edges.
Layout: rank = row * w + col. When w, h are given, supports
rectangular (e.g. 2x3) layouts. Otherwise falls back to square
side = sqrt(world_size).
"""
w, h = _resolve_2d_dims(world_size, w, h, "mesh_2d")
r, c = divmod(rank, w)
return {
"N": ((r - 1) % side) * side + c,
"S": ((r + 1) % side) * side + c,
"W": r * side + (c - 1) % side,
"E": r * side + (c + 1) % side,
"N": ((r - 1) % h) * w + c,
"S": ((r + 1) % h) * w + c,
"W": r * w + (c - 1) % w,
"E": r * w + (c + 1) % w,
}
@@ -73,36 +91,30 @@ def tree_binary(rank: int, world_size: int) -> NeighborMap:
return n
def torus_2d(rank: int, world_size: int) -> NeighborMap:
"""Square 2D torus (N/S/E/W) with wrap-around on all edges.
Alias for mesh_2d (which already wraps). Explicit name for clarity
when used as a SIP-level topology.
"""
return mesh_2d(rank, world_size)
def torus_2d(
rank: int, world_size: int,
w: int | None = None, h: int | None = None,
) -> NeighborMap:
"""2D torus (N/S/E/W) with wrap-around on all edges. Alias for mesh_2d."""
return mesh_2d(rank, world_size, w=w, h=h)
def mesh_2d_no_wrap(rank: int, world_size: int) -> NeighborMap:
"""Square 2D mesh (N/S/E/W) WITHOUT wrap-around.
Edge nodes have fewer neighbors (no wrapping). Used for SIP-level
topologies where physical links don't wrap.
"""
side = int(round(world_size ** 0.5))
if side * side != world_size:
raise ValueError(
f"mesh_2d_no_wrap requires square world_size, got {world_size}"
)
r, c = divmod(rank, side)
def mesh_2d_no_wrap(
rank: int, world_size: int,
w: int | None = None, h: int | None = None,
) -> NeighborMap:
"""2D mesh (N/S/E/W) WITHOUT wrap-around. Supports rectangular dims."""
w, h = _resolve_2d_dims(world_size, w, h, "mesh_2d_no_wrap")
r, c = divmod(rank, w)
n: NeighborMap = {}
if r > 0:
n["N"] = (r - 1) * side + c
if r < side - 1:
n["S"] = (r + 1) * side + c
n["N"] = (r - 1) * w + c
if r < h - 1:
n["S"] = (r + 1) * w + c
if c > 0:
n["W"] = r * side + (c - 1)
if c < side - 1:
n["E"] = r * side + (c + 1)
n["W"] = r * w + (c - 1)
if c < w - 1:
n["E"] = r * w + (c + 1)
return n
+1 -1
View File
@@ -23,7 +23,7 @@ def _hbm_pa(sip: int, cube: int, pe_id: int, spec: dict) -> int:
mm = spec["cube"]["memory_map"]
slice_bytes = mm["hbm_total_gb_per_cube"] * (1 << 30) // mm["hbm_slices_per_cube"]
pa = PhysAddr.pe_hbm_addr(
rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
sip_id=sip, die_id=cube, pe_id=pe_id,
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
)
return pa.encode()
+80 -5
View File
@@ -58,7 +58,18 @@ class IoCpuComponent(ComponentBase):
self._pending[key] = (expected, received, parent_done)
def _dispatch_to_m_cpus(self, env: simpy.Environment, txn: Any) -> Generator:
"""Fan out sub-Transactions to target cube M_CPUs, wait for responses."""
"""Fan out sub-Transactions to target cube M_CPUs, wait for responses.
ADR-0009 D5 (extended): for KernelLaunchMsg, stamp a single global
target_start_ns = env.now + max(IO_CPU → any target PE_CPU path
latency across all target cubes). M_CPU passes this value through
unchanged; every PE in every cube yields until the same sim-time
before beginning kernel execution. Without this, cross-cube
launches would have each cube's M_CPU compute its own per-cube
barrier relative to its local env.now, leaving PEs on different
cubes out of sync (the "h3/h4 dispatch-offset artifact").
"""
import dataclasses
from kernbench.runtime_api.kernel import KernelLaunchMsg, MemoryReadMsg, MemoryWriteMsg
request = txn.request
@@ -72,10 +83,60 @@ class IoCpuComponent(ComponentBase):
txn.done.succeed()
return
# For KernelLaunchMsg, compute the global barrier once here so
# every downstream PE_CPU uses the same target_start_ns.
if isinstance(request, KernelLaunchMsg):
io_overhead = self.ctx.node_overhead_ns.get(self.node.id, 0.0)
global_max_latency = 0.0
pe_ids = self._resolve_pe_ids(
getattr(request, "target_pe", "all")
)
for sip, cube in cube_targets:
try:
m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
io_to_m_path = self.ctx.router.find_node_path(
self.node.id, m_cpu_id,
)
except Exception:
continue
if len(io_to_m_path) < 2:
continue
leg1 = self.ctx.compute_path_latency_ns(
io_to_m_path, nbytes=0,
)
m_overhead = self.ctx.node_overhead_ns.get(m_cpu_id, 0.0)
for pe_id in pe_ids:
pe_cpu_id = (
f"sip{sip}.cube{cube}.pe{pe_id}.pe_cpu"
)
try:
m_to_pe_path = self.ctx.router.find_node_path(
m_cpu_id, pe_cpu_id,
)
except Exception:
continue
if len(m_to_pe_path) < 2:
continue
leg2 = self.ctx.compute_path_latency_ns(
m_to_pe_path, nbytes=0,
)
latency = leg1 + leg2 - io_overhead - m_overhead
if latency > global_max_latency:
global_max_latency = latency
request = dataclasses.replace(
request,
target_start_ns=float(env.now) + global_max_latency,
)
# Setup aggregation
self._pending[request.request_id] = (len(cube_targets), 0, txn.done)
# Fan out to each target cube's M_CPU
# Fan out to each target cube's M_CPU. Kernel-launch fanout
# carries control metadata only; nbytes is forced to 0 for
# KernelLaunchMsg so the launch sub-txns do not occupy data-fabric
# BW (would otherwise serialize 16 cubes worth of fanout on the
# shared first hop and break ADR-0009 D5's barrier prediction).
is_kernel_launch = isinstance(request, KernelLaunchMsg)
for sip, cube in cube_targets:
try:
m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
@@ -86,11 +147,25 @@ class IoCpuComponent(ComponentBase):
continue
sub_txn = Transaction(
request=request, path=path, step=0,
nbytes=txn.nbytes, done=env.event(),
nbytes=0 if is_kernel_launch else txn.nbytes,
done=env.event(),
result_data=txn.result_data,
)
yield self.out_ports[path[1]].put(sub_txn.advance())
def _resolve_pe_ids(self, target_pe: Any) -> list[int]:
"""Resolve target_pe → list of PE indices (mirrors M_CPU logic)."""
if isinstance(target_pe, int):
return [target_pe]
if isinstance(target_pe, tuple):
return list(target_pe)
# "all": all PEs in a cube
n_slices = 8
if self.ctx and self.ctx.spec:
mm = self.ctx.spec.get("cube", {}).get("memory_map", {})
n_slices = mm.get("hbm_slices_per_cube", 8)
return list(range(n_slices))
def _resolve_cube_targets(self, request: Any) -> list[tuple[int, int]]:
"""Return list of (sip, cube) pairs to fan out to."""
from kernbench.runtime_api.kernel import (
@@ -145,10 +220,10 @@ class IoCpuComponent(ComponentBase):
return []
def _cube_from_pa(self, pa_val: int, fallback: int) -> int:
"""Extract cube_id from a physical address, with fallback."""
"""Extract die_id from a physical address, with fallback."""
from kernbench.policy.address.phyaddr import PhysAddr
try:
return PhysAddr.decode(pa_val).cube_id
return PhysAddr.decode(pa_val).die_id
except Exception:
return fallback
+37 -8
View File
@@ -162,7 +162,11 @@ class MCpuComponent(ComponentBase):
Routes through find_node_path (M_CPU → NOC → PE_CPU command edges).
PE_CPU sends ResponseMsg back via NOC → M_CPU on completion.
Then sends aggregate ResponseMsg back to IO_CPU on the reverse path.
ADR-0009 D5: stamps target_start_ns so every PE in this fanout
starts executing at the same env.now regardless of dispatch path.
"""
import dataclasses
request = txn.request
target_pe = getattr(request, "target_pe", "all")
cube_prefix = self.node.id.rsplit(".", 1)[0] # e.g. "sip0.cube0"
@@ -172,9 +176,13 @@ class MCpuComponent(ComponentBase):
txn.done.succeed()
return
# Fan out to each PE_CPU, using response-based aggregation
sub_txns: list[Transaction] = []
n_dispatched = 0
# Resolve per-PE paths. If IO_CPU already stamped a global
# target_start_ns (ADR-0009 D5 extended), pass it through
# unchanged so every PE across every cube uses the same barrier.
# Otherwise (e.g. direct-to-M_CPU launch in a unit test) compute
# a per-cube barrier from env.now.
per_pe: list[tuple[int, list[str], float]] = []
max_latency = 0.0
for pe_id in pe_ids:
pe_cpu_id = f"{cube_prefix}.pe{pe_id}.pe_cpu"
try:
@@ -183,8 +191,24 @@ class MCpuComponent(ComponentBase):
continue
if len(path) < 2:
continue
latency = self.ctx.compute_path_latency_ns(path, nbytes=0)
per_pe.append((pe_id, path, latency))
if latency > max_latency:
max_latency = latency
if getattr(request, "target_start_ns", None) is not None:
stamped_request = request
else:
stamped_request = dataclasses.replace(
request, target_start_ns=float(env.now) + max_latency,
)
# Fan out to each PE_CPU, using response-based aggregation
sub_txns: list[Transaction] = []
n_dispatched = 0
for pe_id, path, _lat in per_pe:
sub_txn = Transaction(
request=request, path=path, step=0,
request=stamped_request, path=path, step=0,
nbytes=0, done=env.event(),
)
yield self.out_ports[path[1]].put(sub_txn.advance())
@@ -204,16 +228,21 @@ class MCpuComponent(ComponentBase):
yield all_done
del self._parent_txns[request.request_id]
# Aggregate PE-internal metrics (max across PEs)
# Aggregate PE-internal metrics (max across PEs and across cubes).
# Multiple M_CPUs share the same result_data dict via IO_CPU fanout;
# merge against the existing value so cubes don't clobber each other.
pe_exec_values = [st.result_data.get("pe_exec_ns", 0.0) for st in sub_txns]
if pe_exec_values:
txn.result_data["pe_exec_ns"] = max(pe_exec_values)
cur = txn.result_data.get("pe_exec_ns", 0.0) or 0.0
txn.result_data["pe_exec_ns"] = max(cur, max(pe_exec_values))
dma_values = [st.result_data.get("dma_ns", 0.0) for st in sub_txns]
if dma_values:
txn.result_data["dma_ns"] = max(dma_values)
cur = txn.result_data.get("dma_ns", 0.0) or 0.0
txn.result_data["dma_ns"] = max(cur, max(dma_values))
compute_values = [st.result_data.get("compute_ns", 0.0) for st in sub_txns]
if compute_values:
txn.result_data["compute_ns"] = max(compute_values)
cur = txn.result_data.get("compute_ns", 0.0) or 0.0
txn.result_data["compute_ns"] = max(cur, max(compute_values))
# Send aggregate response on reverse command path back to IO_CPU
reverse_path = list(reversed(txn.path))
@@ -95,6 +95,13 @@ class PeCpuComponent(ComponentBase):
request = txn.request
yield from self.run(env, 0)
# ADR-0009 D5: synchronized launch barrier. If M_CPU stamped a
# target_start_ns, wait until then so every PE in this launch
# begins pe_exec measurement at the same simulated time.
target_start = getattr(request, "target_start_ns", None)
if target_start is not None and target_start > env.now:
yield env.timeout(float(target_start) - env.now)
kernel_fn = get_kernel(request.kernel_ref.name)
num_programs = self._derive_num_programs(request)
kernel_args = self._unpack_kernel_args(request)
+36 -3
View File
@@ -186,13 +186,37 @@ class PeDmaComponent(PeEngineBase):
# ── IPCQ inbound (fabric → PE_DMA → MemoryStore + PE_IPCQ) ──────
def _handle_ipcq_inbound(self, env: simpy.Environment, txn: Any) -> Generator:
"""At destination PE_DMA: atomically write data and forward metadata.
"""At destination PE_DMA: pay terminal drain, then atomically write
data and forward metadata.
ADR-0023 D9 (drain at inbound terminal): the Transaction carries
``drain_ns = nbytes / bottleneck_bw_on_path`` stamped by the sender
PE_DMA. Like every other Transaction terminal in the simulator (see
``ComponentBase._forward_txn``), this drain must be paid when the
Transaction reaches its destination. SRC-side ``tl.send`` is
fire-and-forget it never yields on ``sub_done`` so paying the
drain here does NOT delay the sender. What it DOES delay is the
IpcqMetaArrival forwarded below: that delay is the only signal
``tl.recv`` on DST blocks on, which is exactly the desired
semantics "send dispatches and returns; recv waits until the
bytes have actually landed in its inbox".
The drain MUST be paid before the atomic block inserting a yield
inside would break invariant I6.
I6 (MUST): no SimPy yield between MemoryStore.write and the
IpcqMetaArrival put into PE_IPCQ.
"""
from kernbench.common.ipcq_types import IpcqMetaArrival
# Pay terminal BW drain before the atomic write/metadata forward.
# Without this, IPCQ effectively got fabric bandwidth for free at
# the terminal (only intermediate-hop overhead_ns was charged),
# making IPCQ lower than raw DMA at large sizes in benchmarks.
drain = getattr(txn, "drain_ns", 0.0)
if drain > 0:
yield env.timeout(drain)
token = txn.request
# ── ATOMIC: do not introduce yield between these two operations ──
@@ -278,7 +302,16 @@ class PeDmaComponent(PeEngineBase):
dma_res = self._dma_write if is_write else self._dma_read
assert dma_res is not None
pa = PhysAddr.decode(addr)
# Translate VA → PA via MMU (same logic as non-pipeline path)
target_pa = addr
if self._mmu is not None:
from kernbench.policy.address.pe_mmu import PageFault
try:
target_pa = self._mmu.translate(addr)
except PageFault:
target_pa = addr # fallback: treat as PA directly
pa = PhysAddr.decode(target_pa)
dst_node = self.ctx.resolver.resolve(pa)
path = self.ctx.router.find_path(self._pe_prefix, dst_node)
drain_ns = self.ctx.compute_drain_ns(path, nbytes)
@@ -290,7 +323,7 @@ class PeDmaComponent(PeEngineBase):
correlation_id="pipeline",
request_id=f"tile_{token.tile_id}",
src_sip=0, src_cube=0, src_pe=0,
dst_pa=addr, nbytes=nbytes,
dst_pa=target_pa, nbytes=nbytes,
is_write=is_write,
)
sub_txn = Transaction(
+18 -7
View File
@@ -338,9 +338,13 @@ class PeIpcqComponent(ComponentBase):
nbytes=req.result_data.get("nbytes", 0),
)
# Fast path credit return — bottleneck BW based latency
env.process(
self._delayed_credit_send(env, direction, qp["peer_credit_store"], qp["my_tail"])
# Credit return: recv blocks on credit-emit so the protocol cost
# (full path latency to deliver the credit metadata back to the
# sender) is reflected in the recv's pe_exec_ns. Models the IPCQ
# control-plane completing the consume-acknowledgement before
# recv returns to the kernel.
yield from self._delayed_credit_send(
env, direction, qp["peer_credit_store"], qp["my_tail"],
)
if not req.done.triggered:
@@ -455,7 +459,12 @@ class PeIpcqComponent(ComponentBase):
yield peer_credit_store.put(meta)
def _credit_latency_ns(self, direction: str) -> float:
"""Compute credit fast path latency = credit_size / bottleneck_bw.
"""Full path latency for the credit-return packet.
Pays per-node overhead + edge prop + drain along the same fabric
the data took. PathRouter.find_path() auto-appends ".pe_dma" to
the source only, so the destination MUST be spelled with the
explicit ".pe_dma" suffix.
Falls back to 0 when ctx/router is unavailable (unit-test mode).
"""
@@ -463,10 +472,12 @@ class PeIpcqComponent(ComponentBase):
return 0.0
qp = self._queue_pairs[direction]
peer = qp["peer"]
peer_pe_prefix = f"sip{peer.sip}.cube{peer.cube}.pe{peer.pe}"
peer_pe_dma = f"sip{peer.sip}.cube{peer.cube}.pe{peer.pe}.pe_dma"
try:
path = self.ctx.router.find_path(self._pe_prefix, peer_pe_prefix)
return self.ctx.compute_drain_ns(path, self._credit_size_bytes)
path = self.ctx.router.find_path(self._pe_prefix, peer_pe_dma)
return self.ctx.compute_path_latency_ns(
path, self._credit_size_bytes,
)
except Exception:
return 0.0
+19
View File
@@ -26,6 +26,9 @@ class ComponentContext:
spec: dict = field(default_factory=dict) # topology spec (cube layout, PE count, etc.)
memory_store: Any = None # MemoryStore for Phase 1 data-aware execution (ADR-0020)
op_logger: Any = None # OpLogger for Phase 1 op recording (ADR-0020)
# node_id -> overhead_ns (ADR-0009 D5: used by M_CPU to compute per-PE
# dispatch latency when stamping target_start_ns on KernelLaunchMsg).
node_overhead_ns: dict[str, float] = field(default_factory=dict)
def get_shared_resource(
self, env: simpy.Environment, key: str, capacity: int = 1,
@@ -52,3 +55,19 @@ class ComponentContext:
if min_bw == float("inf"):
return 0.0
return nbytes / min_bw
def compute_path_latency_ns(self, path: list[str], nbytes: int = 0) -> float:
"""Formula latency along path: wire + per-node overhead + drain.
ADR-0009 D5: M_CPU uses this to compute per-PE dispatch latency
when stamping target_start_ns on KernelLaunchMsg fanout.
"""
total = 0.0
for i in range(len(path) - 1):
edge = self.edge_map.get((path[i], path[i + 1]))
if edge:
total += edge.distance_mm * self.ns_per_mm
for node_id in path:
total += self.node_overhead_ns.get(node_id, 0.0)
total += self.compute_drain_ns(path, nbytes)
return total
@@ -58,7 +58,13 @@ class IoCpuComponent(ComponentBase):
self._pending[key] = (expected, received, parent_done)
def _dispatch_to_m_cpus(self, env: simpy.Environment, txn: Any) -> Generator:
"""Fan out sub-Transactions to target cube M_CPUs, wait for responses."""
"""Fan out sub-Transactions to target cube M_CPUs, wait for responses.
ADR-0009 D5 (extended): stamp a global target_start_ns on
KernelLaunchMsg so every PE across every target cube starts at
the same env.now. See the non-legacy builtin for full rationale.
"""
import dataclasses
from kernbench.runtime_api.kernel import KernelLaunchMsg, MemoryReadMsg, MemoryWriteMsg
request = txn.request
@@ -72,10 +78,53 @@ class IoCpuComponent(ComponentBase):
txn.done.succeed()
return
if isinstance(request, KernelLaunchMsg):
io_overhead = self.ctx.node_overhead_ns.get(self.node.id, 0.0)
global_max_latency = 0.0
pe_ids = self._resolve_pe_ids(
getattr(request, "target_pe", "all")
)
for sip, cube in cube_targets:
try:
m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
io_to_m_path = self.ctx.router.find_node_path(
self.node.id, m_cpu_id,
)
except Exception:
continue
if len(io_to_m_path) < 2:
continue
leg1 = self.ctx.compute_path_latency_ns(
io_to_m_path, nbytes=0,
)
m_overhead = self.ctx.node_overhead_ns.get(m_cpu_id, 0.0)
for pe_id in pe_ids:
pe_cpu_id = (
f"sip{sip}.cube{cube}.pe{pe_id}.pe_cpu"
)
try:
m_to_pe_path = self.ctx.router.find_node_path(
m_cpu_id, pe_cpu_id,
)
except Exception:
continue
if len(m_to_pe_path) < 2:
continue
leg2 = self.ctx.compute_path_latency_ns(
m_to_pe_path, nbytes=0,
)
latency = leg1 + leg2 - io_overhead - m_overhead
if latency > global_max_latency:
global_max_latency = latency
request = dataclasses.replace(
request,
target_start_ns=float(env.now) + global_max_latency,
)
# Setup aggregation
self._pending[request.request_id] = (len(cube_targets), 0, txn.done)
# Fan out to each target cube's M_CPU
is_kernel_launch = isinstance(request, KernelLaunchMsg)
for sip, cube in cube_targets:
try:
m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
@@ -86,11 +135,24 @@ class IoCpuComponent(ComponentBase):
continue
sub_txn = Transaction(
request=request, path=path, step=0,
nbytes=txn.nbytes, done=env.event(),
nbytes=0 if is_kernel_launch else txn.nbytes,
done=env.event(),
result_data=txn.result_data,
)
yield self.out_ports[path[1]].put(sub_txn.advance())
def _resolve_pe_ids(self, target_pe: Any) -> list[int]:
"""Resolve target_pe → list of PE indices (mirrors M_CPU logic)."""
if isinstance(target_pe, int):
return [target_pe]
if isinstance(target_pe, tuple):
return list(target_pe)
n_slices = 8
if self.ctx and self.ctx.spec:
mm = self.ctx.spec.get("cube", {}).get("memory_map", {})
n_slices = mm.get("hbm_slices_per_cube", 8)
return list(range(n_slices))
def _resolve_cube_targets(self, request: Any) -> list[tuple[int, int]]:
"""Return list of (sip, cube) pairs to fan out to."""
from kernbench.runtime_api.kernel import (
@@ -145,10 +207,10 @@ class IoCpuComponent(ComponentBase):
return []
def _cube_from_pa(self, pa_val: int, fallback: int) -> int:
"""Extract cube_id from a physical address, with fallback."""
"""Extract die_id from a physical address, with fallback."""
from kernbench.policy.address.phyaddr import PhysAddr
try:
return PhysAddr.decode(pa_val).cube_id
return PhysAddr.decode(pa_val).die_id
except Exception:
return fallback
@@ -162,7 +162,11 @@ class MCpuComponent(ComponentBase):
Routes through find_node_path (M_CPU NOC PE_CPU command edges).
PE_CPU sends ResponseMsg back via NOC M_CPU on completion.
Then sends aggregate ResponseMsg back to IO_CPU on the reverse path.
ADR-0009 D5: stamps target_start_ns so every PE in this fanout
starts executing at the same env.now regardless of dispatch path.
"""
import dataclasses
request = txn.request
target_pe = getattr(request, "target_pe", "all")
cube_prefix = self.node.id.rsplit(".", 1)[0] # e.g. "sip0.cube0"
@@ -172,9 +176,10 @@ class MCpuComponent(ComponentBase):
txn.done.succeed()
return
# Fan out to each PE_CPU, using response-based aggregation
sub_txns: list[Transaction] = []
n_dispatched = 0
# Resolve per-PE paths. If IO_CPU already stamped a global
# target_start_ns (ADR-0009 D5 extended), pass it through.
per_pe: list[tuple[int, list[str], float]] = []
max_latency = 0.0
for pe_id in pe_ids:
pe_cpu_id = f"{cube_prefix}.pe{pe_id}.pe_cpu"
try:
@@ -183,8 +188,24 @@ class MCpuComponent(ComponentBase):
continue
if len(path) < 2:
continue
latency = self.ctx.compute_path_latency_ns(path, nbytes=0)
per_pe.append((pe_id, path, latency))
if latency > max_latency:
max_latency = latency
if getattr(request, "target_start_ns", None) is not None:
stamped_request = request
else:
stamped_request = dataclasses.replace(
request, target_start_ns=float(env.now) + max_latency,
)
# Fan out to each PE_CPU, using response-based aggregation
sub_txns: list[Transaction] = []
n_dispatched = 0
for pe_id, path, _lat in per_pe:
sub_txn = Transaction(
request=request, path=path, step=0,
request=stamped_request, path=path, step=0,
nbytes=0, done=env.event(),
)
yield self.out_ports[path[1]].put(sub_txn.advance())
@@ -204,16 +225,21 @@ class MCpuComponent(ComponentBase):
yield all_done
del self._parent_txns[request.request_id]
# Aggregate PE-internal metrics (max across PEs)
# Aggregate PE-internal metrics (max across PEs and across cubes).
# Multiple M_CPUs share the same result_data dict via IO_CPU fanout;
# merge against the existing value so cubes don't clobber each other.
pe_exec_values = [st.result_data.get("pe_exec_ns", 0.0) for st in sub_txns]
if pe_exec_values:
txn.result_data["pe_exec_ns"] = max(pe_exec_values)
cur = txn.result_data.get("pe_exec_ns", 0.0) or 0.0
txn.result_data["pe_exec_ns"] = max(cur, max(pe_exec_values))
dma_values = [st.result_data.get("dma_ns", 0.0) for st in sub_txns]
if dma_values:
txn.result_data["dma_ns"] = max(dma_values)
cur = txn.result_data.get("dma_ns", 0.0) or 0.0
txn.result_data["dma_ns"] = max(cur, max(dma_values))
compute_values = [st.result_data.get("compute_ns", 0.0) for st in sub_txns]
if compute_values:
txn.result_data["compute_ns"] = max(compute_values)
cur = txn.result_data.get("compute_ns", 0.0) or 0.0
txn.result_data["compute_ns"] = max(cur, max(compute_values))
# Send aggregate response on reverse command path back to IO_CPU
reverse_path = list(reversed(txn.path))
@@ -71,6 +71,13 @@ class PeCpuComponent(ComponentBase):
request = txn.request
yield from self.run(env, 0)
# ADR-0009 D5: synchronized launch barrier. If M_CPU stamped a
# target_start_ns, wait until then so every PE in this launch
# begins pe_exec measurement at the same simulated time.
target_start = getattr(request, "target_start_ns", None)
if target_start is not None and target_start > env.now:
yield env.timeout(float(target_start) - env.now)
kernel_fn = get_kernel(request.kernel_ref.name)
num_programs = self._derive_num_programs(request)
kernel_args = self._unpack_kernel_args(request)
+4 -5
View File
@@ -89,11 +89,10 @@ class _FreeList:
class PEMemAllocator:
def __init__(
self, rack_id: int, sip_id: int, cube_id: int, pe_id: int, cfg: AddressConfig,
self, sip_id: int, die_id: int, pe_id: int, cfg: AddressConfig,
) -> None:
self._rack_id = rack_id
self._sip_id = sip_id
self._cube_id = cube_id
self._die_id = die_id
self._pe_id = pe_id
self._cfg = cfg
self._hbm = _FreeList(cfg.hbm_slice_bytes)
@@ -108,7 +107,7 @@ class PEMemAllocator:
f"available {self._cfg.hbm_slice_bytes - self._hbm.used}"
)
return PhysAddr.pe_hbm_addr(
rack_id=self._rack_id, sip_id=self._sip_id, cube_id=self._cube_id,
sip_id=self._sip_id, die_id=self._die_id,
pe_id=self._pe_id, pe_local_hbm_offset=offset,
slice_size_bytes=self._cfg.hbm_slice_bytes,
)
@@ -128,7 +127,7 @@ class PEMemAllocator:
f"available {self._cfg.tcm_allocatable_bytes - self._tcm.used}"
)
return PhysAddr.pe_tcm_addr(
rack_id=self._rack_id, sip_id=self._sip_id, cube_id=self._cube_id,
sip_id=self._sip_id, die_id=self._die_id,
pe_id=self._pe_id, tcm_offset=offset,
)
+70 -13
View File
@@ -19,7 +19,14 @@ class PageFault(Exception):
class PeMMU:
"""Per-PE MMU with page-aligned VA→PA translation table.
"""Per-PE MMU with sub-page-capable VA→PA translation table.
Each page-table entry is a list of (start_in_page, end_in_page,
pa_at_offset_zero) regions. This is a SIMULATOR STOPGAP real MMUs
store one PA per page-table entry. Sub-page regions exist here so
DPPolicy layouts that shard below page granularity (e.g. 128 B
payloads with 4 KB pages) don't silently mis-route through last-
write-wins overwrites. Memory note: project_mmu_subpage_stopgap.md.
Args:
page_size: Page size in bytes (default 2 MB).
@@ -34,7 +41,11 @@ class PeMMU:
self._page_size = page_size
self._page_shift = (page_size - 1).bit_length()
self._page_mask = page_size - 1
self._table: dict[int, int] = {} # va_page_number → pa_page_base
# vpn → list of (start_in_page, end_in_page, pa_at_offset_zero).
# pa_at_offset_zero is the PA that offset 0 of the page would map
# to under this region — i.e. translate(off) = pa_at_offset_zero
# + off when start <= off < end.
self._table: dict[int, list[tuple[int, int, int]]] = {}
self._overhead_ns = overhead_ns
@property
@@ -46,21 +57,67 @@ class PeMMU:
return len(self._table)
def map(self, va: int, pa: int, size: int) -> None:
"""Register VA→PA mapping for a contiguous range."""
for off in range(0, size, self._page_size):
vpn = (va + off) >> self._page_shift
self._table[vpn] = pa + off
"""Register VA→PA mapping for a contiguous range.
Sub-page-aware: a single page can hold multiple disjoint regions,
each pointing to a different PA. Later map() calls APPEND a new
region; on overlap with an existing region, the new region wins
for the overlapping offsets (translate iterates in reverse so the
last write takes precedence matches legacy single-PA behavior
when a full page is re-mapped).
"""
end_va = va + size
cur = va
while cur < end_va:
vpn = cur >> self._page_shift
page_base_va = vpn << self._page_shift
page_end_va = page_base_va + self._page_size
region_start = cur - page_base_va
region_end = min(end_va, page_end_va) - page_base_va
# PA seen at offset 0 of page if this region's mapping covered it
pa_at_offset_zero = pa + (cur - va) - region_start
self._table.setdefault(vpn, []).append(
(region_start, region_end, pa_at_offset_zero)
)
cur = page_base_va + region_end
def unmap(self, va: int, size: int) -> None:
"""Remove VA mapping for a contiguous range."""
for off in range(0, size, self._page_size):
vpn = (va + off) >> self._page_shift
self._table.pop(vpn, None)
"""Remove VA mapping for a contiguous range.
Drops any region whose extent is contained within the unmapped
range. Partial overlaps (region straddles the range boundary)
are left in place caller is expected to unmap on the same
boundaries it mapped on.
"""
end_va = va + size
cur = va
while cur < end_va:
vpn = cur >> self._page_shift
page_base_va = vpn << self._page_shift
page_end_va = page_base_va + self._page_size
unmap_start = cur - page_base_va
unmap_end = min(end_va, page_end_va) - page_base_va
regions = self._table.get(vpn)
if regions is not None:
kept = [
r for r in regions
if not (r[0] >= unmap_start and r[1] <= unmap_end)
]
if kept:
self._table[vpn] = kept
else:
del self._table[vpn]
cur = page_base_va + unmap_end
def translate(self, va: int) -> int:
"""Translate VA to PA. Raises PageFault if unmapped."""
vpn = va >> self._page_shift
pa_page_base = self._table.get(vpn)
if pa_page_base is None:
regions = self._table.get(vpn)
if regions is None:
raise PageFault(va)
return pa_page_base + (va & self._page_mask)
offset = va & self._page_mask
# Iterate latest-first so newer map() calls win on overlap
for start, end, pa_at_offset_zero in reversed(regions):
if start <= offset < end:
return pa_at_offset_zero + offset
raise PageFault(va)
+264 -108
View File
@@ -6,6 +6,47 @@ from typing import Literal
MAX_51 = (1 << 51) - 1
# ── Layout constants (ADR-0001 Rev 2) ────────────────────────────────
# [50:47] sip_id (4)
# [46:42] die_id (5)
# [41: 0] local_offset (42)
_SIP_SHIFT = 47
_DIE_SHIFT = 42
_LOCAL_BITS = 42
_LOCAL_MASK = (1 << _LOCAL_BITS) - 1
# AHBM die: [41:38] MBZ, [37] addr_space, [36:0] sub-address
_AHBM_SEL_BIT = 37
_AHBM_LOCAL_USED = 38 # bits actually meaningful for AHBM
# Resource window: [36:34] resource_kind, [33:0] kind_local
_RES_KIND_SHIFT = 34
_RES_KIND_MASK = 0x7
# PE_LOCAL: [32:29] pe_id, [28:25] pe_sub_unit, [24:0] sub_offset
_PE_ID_SHIFT = 29
_PE_SUB_SHIFT = 25
_PE_SUB_OFFSET_BITS = 25
# MCPU_LOCAL: [29:25] mcpu_sub_unit, [24:0] sub_offset
_MCPU_SUB_SHIFT = 25
# CUBE_SRAM: [24:0] sram_offset
_SRAM_OFFSET_BITS = 25
# IOCHIPLET: [41:40] MBZ, [39:0] chiplet_offset
_CHIPLET_LOCAL_BITS = 40
_IOCPU_BOUNDARY = 1 << 31 # 2 GB
# IOCPU: [30:27] iocpu_sub_unit, [26:0] sub_offset
_IOCPU_SUB_SHIFT = 27
_IOCPU_SUB_OFFSET_BITS = 27
# die_id ranges
_AHBM_DIE_MAX = 15
_CHIPLET_DIE_MIN = 16
_CHIPLET_DIE_MAX = 20
class PhysAddrError(Exception):
pass
@@ -22,163 +63,278 @@ def _chk_max(name: str, v: int, maxv: int) -> None:
class UnitType(IntEnum):
PE = 0
MCPU = 1
SRAM = 2
"""resource_kind values for AHBM resource window."""
PE = 0 # PE_LOCAL
MCPU = 1 # MCPU_LOCAL
SRAM = 2 # CUBE_SRAM
class PESubUnit(IntEnum):
PE_CPU_DTCM = 0
MATH_ENGINE_DTCM = 1
IPCQ = 2
PE_CPU_SFR = 3
MATH_ENGINE_SFR = 4
DMA_ENGINE_SFR = 5
PE_TCM = 6
class MCPUSubUnit(IntEnum):
MCPU_ITCM = 0
MCPU_DTCM = 1
IPCQ = 2
MCPU_SFR = 3
MCPU_DMA_SFR = 4
MCPU_SRAM = 5
class IOCPUSubUnit(IntEnum):
IOCPU_ITCM = 0
IOCPU_DTCM = 1
IPCQ = 2
IOCPU_SFR = 3
IO_DMA_SFR = 4
IO_SRAM = 5
@dataclass(frozen=True)
class PhysAddr:
"""
51-bit physical address value object.
"""51-bit physical address value object (ADR-0001 Rev 2).
Layout:
[50:47] rack_id (4)
[46:43] sip_id (4)
[42:38] sip_seg (5) # cube_id
[37:0] local_offset (38) => each segment is 256GB
local_offset:
[37] selector: 1 = HBM window (128GB reserved), 0 = PE resource window
[50:47] sip_id (4) -- 16 SIPs
[46:42] die_id (5) -- 0..15 AHBM, 16..20 IOCHIPLET
[41: 0] local_offset (42) -- 4 TB per die
"""
rack_id: int
sip_id: int
sip_seg: int
die_id: int
local_offset: int
kind: Literal["hbm", "pe_resource", "raw"] = "raw"
cube_id: int = 0
kind: Literal["hbm", "pe_resource", "iocpu", "ual", "raw"] = "raw"
unit_type: UnitType = UnitType.PE
pe_id: int = 0
ext: int = 0
pe_sub_unit: int = 0
sub_offset: int = 0
hbm_offset: int = 0
iocpu_sub_unit: int = 0
chiplet_offset: int = 0
mcpu_sub_unit: int = 0
HBM_WINDOW_BYTES = 1 << 37 # 128GB
HBM_WINDOW_BYTES = 1 << 37 # 128 GB
# ── encode / decode ──────────────────────────────────────────────
def encode(self) -> int:
_chk_range("rack_id", self.rack_id, 4)
_chk_range("sip_id", self.sip_id, 4)
_chk_range("sip_seg", self.sip_seg, 5)
_chk_range("local_offset", self.local_offset, 38)
addr = (self.rack_id << 47) | (self.sip_id << 43) | (self.sip_seg << 38) | self.local_offset
if not (0 <= addr <= MAX_51):
raise PhysAddrError("address exceeds 51-bit space")
_chk_range("die_id", self.die_id, 5)
_chk_range("local_offset", self.local_offset, _LOCAL_BITS)
# MBZ enforcement
if self.die_id <= _AHBM_DIE_MAX:
mbz_top = (self.local_offset >> _AHBM_LOCAL_USED) & 0xF
if mbz_top != 0:
raise PhysAddrError("AHBM local_offset bits [41:38] must be zero")
elif _CHIPLET_DIE_MIN <= self.die_id <= _CHIPLET_DIE_MAX:
mbz_top = (self.local_offset >> _CHIPLET_LOCAL_BITS) & 0x3
if mbz_top != 0:
raise PhysAddrError("IOCHIPLET local_offset bits [41:40] must be zero")
addr = (self.sip_id << _SIP_SHIFT) | (self.die_id << _DIE_SHIFT) | self.local_offset
return addr
@staticmethod
def decode(addr: int) -> PhysAddr:
if not (0 <= addr <= MAX_51):
raise PhysAddrError("addr must be a 51-bit value")
rack = (addr >> 47) & 0xF
sip_id = (addr >> 43) & 0xF
sip_seg = (addr >> 38) & 0x1F
off = addr & ((1 << 38) - 1)
cube_id = sip_seg
sel = (off >> 37) & 0x1
if sel == 1:
hbm_offset = int(off & ((1 << 37) - 1))
return PhysAddr(
rack_id=rack,
sip_id=sip_id,
sip_seg=sip_seg,
local_offset=off,
kind="hbm",
cube_id=cube_id,
hbm_offset=hbm_offset,
)
# PE resource decode
raw_ut = int((off >> 34) & 0x7)
try:
unit_type = UnitType(raw_ut)
except ValueError:
raise PhysAddrError(f"unknown unit_type: {raw_ut}") from None
pe_id = int((off >> 30) & 0xF)
ext = int((off >> 29) & 0x1)
sub_offset = int(off & ((1 << 29) - 1))
return PhysAddr(
rack_id=rack,
sip_id=sip_id,
sip_seg=sip_seg,
local_offset=off,
kind="pe_resource",
cube_id=cube_id,
unit_type=unit_type,
pe_id=pe_id,
ext=ext,
sub_offset=sub_offset,
hbm_offset=0,
)
sip_id = (addr >> _SIP_SHIFT) & 0xF
die_id = (addr >> _DIE_SHIFT) & 0x1F
local_offset = addr & _LOCAL_MASK
if die_id <= _AHBM_DIE_MAX:
return PhysAddr._decode_ahbm(sip_id, die_id, local_offset)
elif _CHIPLET_DIE_MIN <= die_id <= _CHIPLET_DIE_MAX:
return PhysAddr._decode_chiplet(sip_id, die_id, local_offset)
else:
raise PhysAddrError(f"die_id {die_id} is reserved (21..31)")
@staticmethod
def hbm_addr(*, rack_id: int, sip_id: int, cube_id: int, hbm_offset: int) -> PhysAddr:
_chk_max("cube_id", cube_id, 31)
_chk_range("hbm_offset", hbm_offset, 37)
sip_seg = cube_id
local_offset = (1 << 37) | int(hbm_offset)
def _decode_ahbm(sip_id: int, die_id: int, local_offset: int) -> PhysAddr:
sel = (local_offset >> _AHBM_SEL_BIT) & 0x1
if sel == 1:
hbm_offset = int(local_offset & ((1 << _AHBM_SEL_BIT) - 1))
return PhysAddr(
sip_id=sip_id, die_id=die_id, local_offset=local_offset,
kind="hbm", hbm_offset=hbm_offset,
)
# Resource window
res_kind = int((local_offset >> _RES_KIND_SHIFT) & _RES_KIND_MASK)
try:
unit_type = UnitType(res_kind)
except ValueError:
raise PhysAddrError(f"unknown resource_kind: {res_kind}") from None
if unit_type == UnitType.PE:
pe_id = int((local_offset >> _PE_ID_SHIFT) & 0xF)
pe_sub = int((local_offset >> _PE_SUB_SHIFT) & 0xF)
sub_off = int(local_offset & ((1 << _PE_SUB_OFFSET_BITS) - 1))
return PhysAddr(
sip_id=sip_id, die_id=die_id, local_offset=local_offset,
kind="pe_resource", unit_type=unit_type,
pe_id=pe_id, pe_sub_unit=pe_sub, sub_offset=sub_off,
)
elif unit_type == UnitType.MCPU:
mcpu_sub = int((local_offset >> _MCPU_SUB_SHIFT) & 0x1F)
sub_off = int(local_offset & ((1 << _PE_SUB_OFFSET_BITS) - 1))
return PhysAddr(
sip_id=sip_id, die_id=die_id, local_offset=local_offset,
kind="pe_resource", unit_type=unit_type,
mcpu_sub_unit=mcpu_sub, sub_offset=sub_off,
)
else: # SRAM
sub_off = int(local_offset & ((1 << _SRAM_OFFSET_BITS) - 1))
return PhysAddr(
sip_id=sip_id, die_id=die_id, local_offset=local_offset,
kind="pe_resource", unit_type=unit_type,
sub_offset=sub_off,
)
@staticmethod
def _decode_chiplet(sip_id: int, die_id: int, local_offset: int) -> PhysAddr:
chip_off = local_offset & ((1 << _CHIPLET_LOCAL_BITS) - 1)
if chip_off < _IOCPU_BOUNDARY:
iocpu_sub = int((chip_off >> _IOCPU_SUB_SHIFT) & 0xF)
sub_off = int(chip_off & ((1 << _IOCPU_SUB_OFFSET_BITS) - 1))
return PhysAddr(
sip_id=sip_id, die_id=die_id, local_offset=local_offset,
kind="iocpu", chiplet_offset=chip_off,
iocpu_sub_unit=iocpu_sub, sub_offset=sub_off,
)
else:
return PhysAddr(
sip_id=sip_id, die_id=die_id, local_offset=local_offset,
kind="ual", chiplet_offset=chip_off,
)
# ── AHBM factory methods ────────────────────────────────────────
@staticmethod
def hbm_addr(*, sip_id: int, die_id: int, hbm_offset: int) -> PhysAddr:
_chk_max("die_id", die_id, _AHBM_DIE_MAX)
_chk_range("hbm_offset", hbm_offset, _AHBM_SEL_BIT)
local_offset = (1 << _AHBM_SEL_BIT) | int(hbm_offset)
return PhysAddr(
rack_id=rack_id,
sip_id=sip_id,
sip_seg=sip_seg,
local_offset=local_offset,
kind="hbm",
cube_id=cube_id,
hbm_offset=int(hbm_offset),
sip_id=sip_id, die_id=die_id, local_offset=local_offset,
kind="hbm", hbm_offset=int(hbm_offset),
)
@staticmethod
def pe_hbm_addr(
*,
rack_id: int,
sip_id: int,
cube_id: int,
pe_id: int,
pe_local_hbm_offset: int,
slice_size_bytes: int,
*, sip_id: int, die_id: int,
pe_id: int, pe_local_hbm_offset: int, slice_size_bytes: int,
) -> PhysAddr:
_chk_max("cube_id", cube_id, 31)
_chk_max("die_id", die_id, _AHBM_DIE_MAX)
_chk_range("pe_id", pe_id, 4)
if not (0 <= pe_local_hbm_offset < slice_size_bytes):
raise PhysAddrError("pe_local_hbm_offset out of PE local slice range")
hbm_offset = int(pe_id) * int(slice_size_bytes) + int(pe_local_hbm_offset)
if not (0 <= hbm_offset < PhysAddr.HBM_WINDOW_BYTES):
raise PhysAddrError("HBM offset exceeds reserved 128GB window")
return PhysAddr.hbm_addr(
rack_id=rack_id, sip_id=sip_id, cube_id=cube_id, hbm_offset=hbm_offset
)
return PhysAddr.hbm_addr(sip_id=sip_id, die_id=die_id, hbm_offset=hbm_offset)
@staticmethod
def hbm_pe_id(hbm_offset: int, slice_size_bytes: int) -> int:
return hbm_offset // slice_size_bytes
@staticmethod
def cube_sram_addr(
*, rack_id: int, sip_id: int, cube_id: int, sram_offset: int,
def pe_tcm_addr(
*, sip_id: int, die_id: int, pe_id: int, tcm_offset: int,
) -> PhysAddr:
_chk_max("cube_id", cube_id, 31)
_chk_range("sram_offset", sram_offset, 29)
sip_seg = cube_id
local_offset = (UnitType.SRAM << 34) | sram_offset
return PhysAddr(
rack_id=rack_id, sip_id=sip_id, sip_seg=sip_seg,
local_offset=local_offset,
kind="pe_resource", cube_id=cube_id,
unit_type=UnitType.SRAM, sub_offset=sram_offset,
return PhysAddr.pe_resource_addr(
sip_id=sip_id, die_id=die_id, pe_id=pe_id,
pe_sub_unit=PESubUnit.PE_TCM, sub_offset=tcm_offset,
)
@staticmethod
def pe_tcm_addr(
*, rack_id: int, sip_id: int, cube_id: int, pe_id: int, tcm_offset: int,
def pe_resource_addr(
*, sip_id: int, die_id: int, pe_id: int,
pe_sub_unit: int, sub_offset: int,
) -> PhysAddr:
_chk_max("cube_id", cube_id, 31)
_chk_max("die_id", die_id, _AHBM_DIE_MAX)
_chk_range("pe_id", pe_id, 4)
_chk_range("tcm_offset", tcm_offset, 29)
sip_seg = cube_id
local_offset = (UnitType.PE << 34) | (pe_id << 30) | tcm_offset
return PhysAddr(
rack_id=rack_id, sip_id=sip_id, sip_seg=sip_seg,
local_offset=local_offset,
kind="pe_resource", cube_id=cube_id,
unit_type=UnitType.PE, pe_id=pe_id, sub_offset=tcm_offset,
_chk_range("pe_sub_unit", pe_sub_unit, 4)
_chk_range("sub_offset", sub_offset, _PE_SUB_OFFSET_BITS)
local_offset = (
(UnitType.PE << _RES_KIND_SHIFT)
| (pe_id << _PE_ID_SHIFT)
| (pe_sub_unit << _PE_SUB_SHIFT)
| sub_offset
)
return PhysAddr(
sip_id=sip_id, die_id=die_id, local_offset=local_offset,
kind="pe_resource", unit_type=UnitType.PE,
pe_id=pe_id, pe_sub_unit=pe_sub_unit, sub_offset=sub_offset,
)
@staticmethod
def cube_sram_addr(
*, sip_id: int, die_id: int, sram_offset: int,
) -> PhysAddr:
_chk_max("die_id", die_id, _AHBM_DIE_MAX)
_chk_range("sram_offset", sram_offset, _SRAM_OFFSET_BITS)
local_offset = (UnitType.SRAM << _RES_KIND_SHIFT) | sram_offset
return PhysAddr(
sip_id=sip_id, die_id=die_id, local_offset=local_offset,
kind="pe_resource", unit_type=UnitType.SRAM, sub_offset=sram_offset,
)
@staticmethod
def mcpu_resource_addr(
*, sip_id: int, die_id: int, mcpu_sub_unit: int, sub_offset: int,
) -> PhysAddr:
_chk_max("die_id", die_id, _AHBM_DIE_MAX)
_chk_range("mcpu_sub_unit", mcpu_sub_unit, 5)
_chk_range("sub_offset", sub_offset, _PE_SUB_OFFSET_BITS)
local_offset = (
(UnitType.MCPU << _RES_KIND_SHIFT)
| (mcpu_sub_unit << _MCPU_SUB_SHIFT)
| sub_offset
)
return PhysAddr(
sip_id=sip_id, die_id=die_id, local_offset=local_offset,
kind="pe_resource", unit_type=UnitType.MCPU,
mcpu_sub_unit=mcpu_sub_unit, sub_offset=sub_offset,
)
# ── IOCHIPLET factory methods ────────────────────────────────────
@staticmethod
def iocpu_resource_addr(
*, sip_id: int, die_id: int, iocpu_sub_unit: int, sub_offset: int,
) -> PhysAddr:
_chk_max("die_id", die_id, _CHIPLET_DIE_MAX)
if die_id < _CHIPLET_DIE_MIN:
raise PhysAddrError(
f"die_id {die_id} is not an IOCHIPLET "
f"(must be {_CHIPLET_DIE_MIN}..{_CHIPLET_DIE_MAX})"
)
_chk_range("iocpu_sub_unit", iocpu_sub_unit, 4)
_chk_range("sub_offset", sub_offset, _IOCPU_SUB_OFFSET_BITS)
chiplet_offset = (iocpu_sub_unit << _IOCPU_SUB_SHIFT) | sub_offset
if chiplet_offset >= _IOCPU_BOUNDARY:
raise PhysAddrError("IOCPU region overflow (must be < 2 GB)")
return PhysAddr(
sip_id=sip_id, die_id=die_id, local_offset=chiplet_offset,
kind="iocpu", chiplet_offset=chiplet_offset,
iocpu_sub_unit=iocpu_sub_unit, sub_offset=sub_offset,
)
@staticmethod
def ual_addr(*, sip_id: int, die_id: int, ual_offset: int) -> PhysAddr:
_chk_max("die_id", die_id, _CHIPLET_DIE_MAX)
if die_id < _CHIPLET_DIE_MIN:
raise PhysAddrError(f"die_id {die_id} is not an IOCHIPLET")
chiplet_offset = _IOCPU_BOUNDARY + ual_offset
_chk_range("chiplet_offset", chiplet_offset, _CHIPLET_LOCAL_BITS)
return PhysAddr(
sip_id=sip_id, die_id=die_id, local_offset=chiplet_offset,
kind="ual", chiplet_offset=chiplet_offset,
)
+5 -5
View File
@@ -27,16 +27,16 @@ class AddressResolver:
def resolve(self, addr: PhysAddr) -> str:
s = addr.sip_id
c = addr.cube_id
d = addr.die_id
if addr.kind == "hbm":
node_id = f"sip{s}.cube{c}.hbm_ctrl"
node_id = f"sip{s}.cube{d}.hbm_ctrl"
elif addr.kind == "pe_resource":
if addr.unit_type == UnitType.PE:
node_id = f"sip{s}.cube{c}.pe{addr.pe_id}.pe_tcm"
node_id = f"sip{s}.cube{d}.pe{addr.pe_id}.pe_tcm"
elif addr.unit_type == UnitType.SRAM:
node_id = f"sip{s}.cube{c}.sram"
node_id = f"sip{s}.cube{d}.sram"
elif addr.unit_type == UnitType.MCPU:
node_id = f"sip{s}.cube{c}.m_cpu"
node_id = f"sip{s}.cube{d}.m_cpu"
else:
raise RoutingError(f"unsupported unit_type: {addr.unit_type}")
else:
+1 -1
View File
@@ -385,7 +385,7 @@ class RuntimeContext:
for cube_id in range(cubes_per_sip):
for pe_id in range(pes_per_cube):
self._allocators[(sip_id, cube_id, pe_id)] = PEMemAllocator(
rack_id=0, sip_id=sip_id, cube_id=cube_id, pe_id=pe_id, cfg=cfg,
sip_id=sip_id, die_id=cube_id, pe_id=pe_id, cfg=cfg,
)
# Initialize VA allocator (MMU mappings are installed via fabric MmuMapMsg)
+12 -1
View File
@@ -113,7 +113,18 @@ class AhbmCCLBackend:
)
n_elem = shards[0].nbytes // tensor.itemsize
kernel_fn = self._algo_module.kernel
kernel_args = self._algo_module.kernel_args(self._world_size, n_elem)
# Derive effective cube dims from tensor's actual shard placement
# (may differ from topology mesh when TP uses fewer cubes).
sip0_cubes = sorted({s.cube for s in shards if s.sip == shards[0].sip})
eff_n_cubes = len(sip0_cubes) if sip0_cubes else 1
if eff_n_cubes == 1:
eff_cube_w, eff_cube_h = 1, 1
else:
eff_cube_w, eff_cube_h = self._cube_w, self._cube_h
kernel_args = self._algo_module.kernel_args(
self._world_size, n_elem,
cube_w=eff_cube_w, cube_h=eff_cube_h,
)
# Resolve sip_rank from the current greenlet's bound rank
from greenlet import getcurrent as _gc
+5
View File
@@ -90,6 +90,11 @@ class KernelLaunchMsg:
args: tuple[KernelArg, ...]
target_cubes: tuple[int, ...] | Literal["all"] = "all"
target_pe: int | tuple[int, ...] | Literal["all"] = "all"
# ADR-0009 D5: synchronized kernel start. When set, each PE_CPU yields
# until env.now >= target_start_ns before beginning kernel execution,
# so every PE in a launch starts at the same simulated time regardless
# of its M_CPU dispatch path length. Stamped by M_CPU fan-out.
target_start_ns: float | None = None
msg_type: Literal["kernel_launch"] = "kernel_launch"
+4
View File
@@ -67,6 +67,10 @@ class GraphEngine:
spec=graph.spec,
memory_store=self._memory_store,
op_logger=self._op_logger,
node_overhead_ns={
nid: float(n.attrs.get("overhead_ns", 0.0))
for nid, n in graph.nodes.items()
},
)
self._components: dict[str, ComponentBase] = {
node_id: ComponentRegistry.create(node, overrides, ctx)
+3 -3
View File
@@ -212,7 +212,7 @@ def _generate_probe_h2d(graph, edge_map) -> list[dict]:
t_offset = 0.0
for rid, (name, cube, hops) in enumerate(cases):
pa = PhysAddr.pe_hbm_addr(
rack_id=0, sip_id=0, cube_id=cube, pe_id=0,
sip_id=0, die_id=cube, pe_id=0,
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
)
dst_node = resolver.resolve(pa)
@@ -256,7 +256,7 @@ def _generate_probe_d2h(graph, edge_map) -> list[dict]:
t_offset = 0.0
for rid, (name, cube, hops) in enumerate(cases):
pa = PhysAddr.pe_hbm_addr(
rack_id=0, sip_id=0, cube_id=cube, pe_id=0,
sip_id=0, die_id=cube, pe_id=0,
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
)
dst_node = resolver.resolve(pa)
@@ -310,7 +310,7 @@ def _generate_probe_pe_dma(graph, edge_map) -> list[dict]:
t_offset = 0.0
for rid, (name, sip, src_cube, src_pe, dst_cube, dst_pe) in enumerate(cases):
pa = PhysAddr.pe_hbm_addr(
rack_id=0, sip_id=sip, cube_id=dst_cube, pe_id=dst_pe,
sip_id=sip, die_id=dst_cube, pe_id=dst_pe,
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
)
dst_node = resolver.resolve(pa)
Binary file not shown.

After

Width:  |  Height:  |  Size: 41 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 87 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 39 KiB

+37
View File
@@ -0,0 +1,37 @@
algorithm,sip_topology,n_sips,n_elem,bytes_per_pe,bytes_per_sip,latency_ns
intercube_allreduce,mesh_2d_no_wrap,6,8,16,256,3508.4249999999993
intercube_allreduce,mesh_2d_no_wrap,6,32,64,1024,3515.55
intercube_allreduce,mesh_2d_no_wrap,6,64,128,2048,3525.0499999999975
intercube_allreduce,mesh_2d_no_wrap,6,128,256,4096,3544.049999999992
intercube_allreduce,mesh_2d_no_wrap,6,512,1024,16384,3667.049999999992
intercube_allreduce,mesh_2d_no_wrap,6,1024,2048,32768,3837.049999999992
intercube_allreduce,mesh_2d_no_wrap,6,2048,4096,65536,4177.049999999992
intercube_allreduce,mesh_2d_no_wrap,6,4096,8192,131072,4857.049999999959
intercube_allreduce,mesh_2d_no_wrap,6,8192,16384,262144,6217.049999999945
intercube_allreduce,mesh_2d_no_wrap,6,16384,32768,524288,8937.049999999937
intercube_allreduce,mesh_2d_no_wrap,6,32768,65536,1048576,14377.049999999872
intercube_allreduce,mesh_2d_no_wrap,6,49152,98304,1572864,19817.049999999872
intercube_allreduce,ring_1d,6,8,16,256,3073.1299999999937
intercube_allreduce,ring_1d,6,32,64,1024,3079.8799999999947
intercube_allreduce,ring_1d,6,64,128,2048,3088.879999999992
intercube_allreduce,ring_1d,6,128,256,4096,3106.8799999999865
intercube_allreduce,ring_1d,6,512,1024,16384,3225.8799999999865
intercube_allreduce,ring_1d,6,1024,2048,32768,3391.8799999999865
intercube_allreduce,ring_1d,6,2048,4096,65536,3723.8799999999865
intercube_allreduce,ring_1d,6,4096,8192,131072,4387.879999999965
intercube_allreduce,ring_1d,6,8192,16384,262144,5715.879999999957
intercube_allreduce,ring_1d,6,16384,32768,524288,8371.879999999932
intercube_allreduce,ring_1d,6,32768,65536,1048576,13683.879999999903
intercube_allreduce,ring_1d,6,49152,98304,1572864,18995.879999999917
intercube_allreduce,torus_2d,6,8,16,256,2190.4799999999923
intercube_allreduce,torus_2d,6,32,64,1024,2196.479999999993
intercube_allreduce,torus_2d,6,64,128,2048,2204.4799999999905
intercube_allreduce,torus_2d,6,128,256,4096,2220.479999999985
intercube_allreduce,torus_2d,6,512,1024,16384,2325.479999999985
intercube_allreduce,torus_2d,6,1024,2048,32768,2471.479999999985
intercube_allreduce,torus_2d,6,2048,4096,65536,2763.479999999985
intercube_allreduce,torus_2d,6,4096,8192,131072,3347.4799999999777
intercube_allreduce,torus_2d,6,8192,16384,262144,4515.4799999999705
intercube_allreduce,torus_2d,6,16384,32768,524288,6851.479999999952
intercube_allreduce,torus_2d,6,32768,65536,1048576,11523.479999999923
intercube_allreduce,torus_2d,6,49152,98304,1572864,16195.479999999952
1 algorithm sip_topology n_sips n_elem bytes_per_pe bytes_per_sip latency_ns
2 intercube_allreduce mesh_2d_no_wrap 6 8 16 256 3508.4249999999993
3 intercube_allreduce mesh_2d_no_wrap 6 32 64 1024 3515.55
4 intercube_allreduce mesh_2d_no_wrap 6 64 128 2048 3525.0499999999975
5 intercube_allreduce mesh_2d_no_wrap 6 128 256 4096 3544.049999999992
6 intercube_allreduce mesh_2d_no_wrap 6 512 1024 16384 3667.049999999992
7 intercube_allreduce mesh_2d_no_wrap 6 1024 2048 32768 3837.049999999992
8 intercube_allreduce mesh_2d_no_wrap 6 2048 4096 65536 4177.049999999992
9 intercube_allreduce mesh_2d_no_wrap 6 4096 8192 131072 4857.049999999959
10 intercube_allreduce mesh_2d_no_wrap 6 8192 16384 262144 6217.049999999945
11 intercube_allreduce mesh_2d_no_wrap 6 16384 32768 524288 8937.049999999937
12 intercube_allreduce mesh_2d_no_wrap 6 32768 65536 1048576 14377.049999999872
13 intercube_allreduce mesh_2d_no_wrap 6 49152 98304 1572864 19817.049999999872
14 intercube_allreduce ring_1d 6 8 16 256 3073.1299999999937
15 intercube_allreduce ring_1d 6 32 64 1024 3079.8799999999947
16 intercube_allreduce ring_1d 6 64 128 2048 3088.879999999992
17 intercube_allreduce ring_1d 6 128 256 4096 3106.8799999999865
18 intercube_allreduce ring_1d 6 512 1024 16384 3225.8799999999865
19 intercube_allreduce ring_1d 6 1024 2048 32768 3391.8799999999865
20 intercube_allreduce ring_1d 6 2048 4096 65536 3723.8799999999865
21 intercube_allreduce ring_1d 6 4096 8192 131072 4387.879999999965
22 intercube_allreduce ring_1d 6 8192 16384 262144 5715.879999999957
23 intercube_allreduce ring_1d 6 16384 32768 524288 8371.879999999932
24 intercube_allreduce ring_1d 6 32768 65536 1048576 13683.879999999903
25 intercube_allreduce ring_1d 6 49152 98304 1572864 18995.879999999917
26 intercube_allreduce torus_2d 6 8 16 256 2190.4799999999923
27 intercube_allreduce torus_2d 6 32 64 1024 2196.479999999993
28 intercube_allreduce torus_2d 6 64 128 2048 2204.4799999999905
29 intercube_allreduce torus_2d 6 128 256 4096 2220.479999999985
30 intercube_allreduce torus_2d 6 512 1024 16384 2325.479999999985
31 intercube_allreduce torus_2d 6 1024 2048 32768 2471.479999999985
32 intercube_allreduce torus_2d 6 2048 4096 65536 2763.479999999985
33 intercube_allreduce torus_2d 6 4096 8192 131072 3347.4799999999777
34 intercube_allreduce torus_2d 6 8192 16384 262144 4515.4799999999705
35 intercube_allreduce torus_2d 6 16384 32768 524288 6851.479999999952
36 intercube_allreduce torus_2d 6 32768 65536 1048576 11523.479999999923
37 intercube_allreduce torus_2d 6 49152 98304 1572864 16195.479999999952
Binary file not shown.

After

Width:  |  Height:  |  Size: 194 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 39 KiB

+34
View File
@@ -7,11 +7,45 @@ stateful/SimPy-event-consuming and MUST NOT be shared).
"""
from __future__ import annotations
import os
import pytest
from kernbench.topology.builder import resolve_topology
def pytest_sessionfinish(session, exitstatus):
"""Aggregate parametrized sweep rows into combined CSV + PNG plots.
Runs on the controller node only (xdist worker processes set
``PYTEST_XDIST_WORKER``; we skip those). Idempotent does nothing
if no sweep rows are present (e.g., when the sweep was filtered out).
"""
if os.environ.get("PYTEST_XDIST_WORKER"):
return
import importlib.util
import sys
from pathlib import Path
mod_path = Path(__file__).parent / "test_allreduce_multidevice.py"
if not mod_path.exists():
return
spec = importlib.util.spec_from_file_location(
"_test_allreduce_multidevice_for_aggregate", mod_path,
)
if spec is None or spec.loader is None:
return
mod = importlib.util.module_from_spec(spec)
sys.modules[spec.name] = mod
try:
spec.loader.exec_module(mod)
agg = getattr(mod, "_aggregate_sweep_plots", None)
if agg is not None:
agg()
except Exception as e:
print(f"[conftest] sweep aggregation failed: {e}")
@pytest.fixture(scope="session")
def topology():
"""Session-scoped parsed topology (immutable graph + spec).
Binary file not shown.

After

Width:  |  Height:  |  Size: 48 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 48 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 51 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 100 KiB

+81
View File
@@ -0,0 +1,81 @@
hop,label,size_bytes,path,total_ns
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,ipcq,31.1399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,raw,12.019999999996799
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,ipcq,32.6399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,raw,13.019999999996799
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,ipcq,34.1399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,raw,14.019999999996799
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,ipcq,35.6399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,raw,15.019999999996799
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,ipcq,38.6399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,raw,17.0199999999968
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,ipcq,41.6399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,raw,19.0199999999968
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,ipcq,53.6399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,raw,27.0199999999968
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,ipcq,77.6399999999976
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,raw,43.0199999999968
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,ipcq,125.64000000000306
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,raw,75.02000000000407
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,ipcq,149.64000000000306
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,raw,91.02000000000407
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,ipcq,31.1399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,raw,12.019999999996799
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,ipcq,32.6399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,raw,13.019999999996799
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,ipcq,34.1399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,raw,14.019999999996799
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,ipcq,35.6399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,raw,15.019999999996799
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,ipcq,38.6399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,raw,17.0199999999968
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,ipcq,41.6399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,raw,19.0199999999968
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,ipcq,53.6399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,raw,27.0199999999968
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,ipcq,77.6399999999976
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,raw,43.0199999999968
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,ipcq,125.64000000000306
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,raw,75.02000000000407
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,ipcq,149.64000000000306
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,raw,91.02000000000407
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,ipcq,67.15999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,raw,68.53999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,ipcq,68.65999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,raw,70.03999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,ipcq,70.15999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,raw,71.53999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,ipcq,71.65999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,raw,73.03999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,ipcq,74.65999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,raw,76.03999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,ipcq,77.65999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,raw,79.03999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,ipcq,89.65999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,raw,91.03999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,ipcq,113.65999999999804
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,raw,115.03999999999724
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,ipcq,161.65999999999985
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,raw,163.04000000000087
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,ipcq,185.65999999999985
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,raw,187.04000000000087
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,ipcq,87.15999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,raw,88.53999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,ipcq,88.65999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,raw,90.03999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,ipcq,90.15999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,raw,91.53999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,ipcq,91.65999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,raw,93.03999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,ipcq,94.65999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,raw,96.03999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,ipcq,97.65999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,raw,99.03999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,ipcq,109.65999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,raw,111.03999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,ipcq,133.65999999999804
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,raw,135.03999999999724
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,ipcq,181.65999999999985
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,raw,183.04000000000087
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,ipcq,205.65999999999985
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,raw,207.04000000000087
1 hop label size_bytes path total_ns
2 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 128 ipcq 31.1399999999976
3 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 128 raw 12.019999999996799
4 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 256 ipcq 32.6399999999976
5 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 256 raw 13.019999999996799
6 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 384 ipcq 34.1399999999976
7 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 384 raw 14.019999999996799
8 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 512 ipcq 35.6399999999976
9 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 512 raw 15.019999999996799
10 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 768 ipcq 38.6399999999976
11 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 768 raw 17.0199999999968
12 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 1024 ipcq 41.6399999999976
13 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 1024 raw 19.0199999999968
14 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 2048 ipcq 53.6399999999976
15 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 2048 raw 27.0199999999968
16 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 4096 ipcq 77.6399999999976
17 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 4096 raw 43.0199999999968
18 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 8192 ipcq 125.64000000000306
19 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 8192 raw 75.02000000000407
20 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 10240 ipcq 149.64000000000306
21 h1_intra_horizontal Intra-cube horizontal (pe0 to pe1) 10240 raw 91.02000000000407
22 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 128 ipcq 31.1399999999976
23 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 128 raw 12.019999999996799
24 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 256 ipcq 32.6399999999976
25 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 256 raw 13.019999999996799
26 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 384 ipcq 34.1399999999976
27 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 384 raw 14.019999999996799
28 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 512 ipcq 35.6399999999976
29 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 512 raw 15.019999999996799
30 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 768 ipcq 38.6399999999976
31 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 768 raw 17.0199999999968
32 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 1024 ipcq 41.6399999999976
33 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 1024 raw 19.0199999999968
34 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 2048 ipcq 53.6399999999976
35 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 2048 raw 27.0199999999968
36 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 4096 ipcq 77.6399999999976
37 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 4096 raw 43.0199999999968
38 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 8192 ipcq 125.64000000000306
39 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 8192 raw 75.02000000000407
40 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 10240 ipcq 149.64000000000306
41 h2_intra_vertical Intra-cube vertical (pe0 to pe4) 10240 raw 91.02000000000407
42 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 128 ipcq 67.15999999999804
43 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 128 raw 68.53999999999724
44 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 256 ipcq 68.65999999999804
45 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 256 raw 70.03999999999724
46 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 384 ipcq 70.15999999999804
47 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 384 raw 71.53999999999724
48 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 512 ipcq 71.65999999999804
49 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 512 raw 73.03999999999724
50 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 768 ipcq 74.65999999999804
51 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 768 raw 76.03999999999724
52 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 1024 ipcq 77.65999999999804
53 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 1024 raw 79.03999999999724
54 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 2048 ipcq 89.65999999999804
55 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 2048 raw 91.03999999999724
56 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 4096 ipcq 113.65999999999804
57 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 4096 raw 115.03999999999724
58 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 8192 ipcq 161.65999999999985
59 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 8192 raw 163.04000000000087
60 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 10240 ipcq 185.65999999999985
61 h3_inter_cube_horizontal Inter-cube horizontal (cube0 to cube1) 10240 raw 187.04000000000087
62 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 128 ipcq 87.15999999999804
63 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 128 raw 88.53999999999724
64 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 256 ipcq 88.65999999999804
65 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 256 raw 90.03999999999724
66 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 384 ipcq 90.15999999999804
67 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 384 raw 91.53999999999724
68 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 512 ipcq 91.65999999999804
69 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 512 raw 93.03999999999724
70 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 768 ipcq 94.65999999999804
71 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 768 raw 96.03999999999724
72 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 1024 ipcq 97.65999999999804
73 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 1024 raw 99.03999999999724
74 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 2048 ipcq 109.65999999999804
75 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 2048 raw 111.03999999999724
76 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 4096 ipcq 133.65999999999804
77 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 4096 raw 135.03999999999724
78 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 8192 ipcq 181.65999999999985
79 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 8192 raw 183.04000000000087
80 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 10240 ipcq 205.65999999999985
81 h4_inter_cube_vertical Inter-cube vertical (cube0 to cube4) 10240 raw 207.04000000000087
+1 -1
View File
@@ -149,7 +149,7 @@ def _make_tuple_allocators(
) -> dict[tuple[int, int, int], PEMemAllocator]:
return {
(s, c, p): PEMemAllocator(
rack_id=0, sip_id=s, cube_id=c, pe_id=p, cfg=_CFG,
sip_id=s, die_id=c, pe_id=p, cfg=_CFG,
)
for s in range(num_sips)
for c in range(num_cubes)
+641 -13
View File
@@ -22,13 +22,23 @@ from kernbench.ccl.sfr_config import configure_sfr_intercube_multisip
from kernbench.policy.placement.dp import DPPolicy
def _sip_topo_dims(sip_topo: str, n_sips: int) -> tuple[int, int]:
def _sip_topo_dims(
sip_topo: str, n_sips: int,
spec_w: int | None = None, spec_h: int | None = None,
) -> tuple[int, int]:
if sip_topo == "ring_1d":
return (0, 0)
if spec_w is not None and spec_h is not None:
if spec_w * spec_h != n_sips:
raise ValueError(
f"sip layout {spec_w}x{spec_h} != n_sips ({n_sips})"
)
return (spec_w, spec_h)
side = int(round(math.sqrt(n_sips)))
if side * side != n_sips:
raise ValueError(
f"SIP topology '{sip_topo}' requires square n_sips, got {n_sips}"
f"SIP topology '{sip_topo}' requires square n_sips or "
f"explicit w/h in spec, got {n_sips}"
)
return (side, side)
@@ -54,10 +64,13 @@ def run_allreduce(
topo_name_to_kind = algo_module.TOPO_NAME_TO_KIND
n_elem = int(cfg.get("n_elem", 8))
n_sips = int(spec.get("system", {}).get("sips", {}).get("count", 1))
sip_topo = str(
spec.get("system", {}).get("sips", {}).get("topology", "ring_1d")
)
sips_cfg = spec.get("system", {}).get("sips", {})
n_sips = int(sips_cfg.get("count", 1))
sip_topo = str(sips_cfg.get("topology", "ring_1d"))
spec_sip_w = sips_cfg.get("w")
spec_sip_h = sips_cfg.get("h")
spec_sip_w = int(spec_sip_w) if spec_sip_w is not None else None
spec_sip_h = int(spec_sip_h) if spec_sip_h is not None else None
cm = spec["sip"]["cube_mesh"]
cube_w = int(cm["w"])
@@ -65,7 +78,9 @@ def run_allreduce(
n_cubes = cube_w * cube_h
sip_topo_kind = topo_name_to_kind.get(sip_topo, 0)
sip_topo_w, sip_topo_h = _sip_topo_dims(sip_topo, n_sips)
sip_topo_w, sip_topo_h = _sip_topo_dims(
sip_topo, n_sips, spec_w=spec_sip_w, spec_h=spec_sip_h,
)
algo_name = cfg.get("algorithm", "allreduce")
print(f"\n{'=' * 60}")
@@ -173,18 +188,36 @@ from kernbench.topology.builder import resolve_topology
TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
CONFIGS = [
pytest.param("intercube_allreduce", "ring_1d", 2, id="ring_2sip"),
pytest.param("intercube_allreduce", "torus_2d", 4, id="torus_4sip"),
pytest.param("intercube_allreduce", "mesh_2d_no_wrap", 4, id="mesh_4sip"),
pytest.param(
"intercube_allreduce", "ring_1d", 6, None, None,
id="ring_6sip",
),
pytest.param(
"intercube_allreduce", "torus_2d", 6, 2, 3,
id="torus_6sip_2x3",
),
pytest.param(
"intercube_allreduce", "mesh_2d_no_wrap", 6, 2, 3,
id="mesh_6sip_2x3",
),
]
def _write_temp_configs(tmp_path, sip_topology, n_sips, algorithm):
def _write_temp_configs(
tmp_path, sip_topology, n_sips, algorithm, n_elem_override=None,
sip_w=None, sip_h=None,
):
"""Write temp topology.yaml and ccl.yaml with the given overrides."""
with open(TOPOLOGY_PATH) as f:
topo_cfg = yaml.safe_load(f)
topo_cfg["system"]["sips"]["count"] = n_sips
topo_cfg["system"]["sips"]["topology"] = sip_topology
if sip_w is not None and sip_h is not None:
topo_cfg["system"]["sips"]["w"] = int(sip_w)
topo_cfg["system"]["sips"]["h"] = int(sip_h)
else:
topo_cfg["system"]["sips"].pop("w", None)
topo_cfg["system"]["sips"].pop("h", None)
topo_path = tmp_path / "topology.yaml"
with open(topo_path, "w") as f:
yaml.dump(topo_cfg, f, default_flow_style=False)
@@ -193,6 +226,15 @@ def _write_temp_configs(tmp_path, sip_topology, n_sips, algorithm):
with open(ccl_path) as f:
ccl_cfg = yaml.safe_load(f)
ccl_cfg["defaults"]["algorithm"] = algorithm
if n_elem_override is not None:
ccl_cfg.setdefault("algorithms", {}).setdefault(
algorithm, {},
)["n_elem"] = int(n_elem_override)
# Ensure IPCQ slot is big enough for the per-message payload.
per_msg_bytes = int(n_elem_override) * 2 # f16
default_slot = int(ccl_cfg["defaults"].get("slot_size", 4096))
if per_msg_bytes > default_slot:
ccl_cfg["defaults"]["slot_size"] = per_msg_bytes
tmp_ccl = tmp_path / "ccl.yaml"
with open(tmp_ccl, "w") as f:
yaml.dump(ccl_cfg, f, default_flow_style=False)
@@ -200,10 +242,15 @@ def _write_temp_configs(tmp_path, sip_topology, n_sips, algorithm):
return str(topo_path), str(tmp_ccl)
@pytest.mark.parametrize("algorithm,sip_topology,n_sips", CONFIGS)
def test_allreduce(tmp_path, algorithm, sip_topology, n_sips):
@pytest.mark.parametrize(
"algorithm,sip_topology,n_sips,sip_w,sip_h", CONFIGS,
)
def test_allreduce(
tmp_path, algorithm, sip_topology, n_sips, sip_w, sip_h,
):
topo_path, ccl_path = _write_temp_configs(
tmp_path, sip_topology, n_sips, algorithm,
sip_w=sip_w, sip_h=sip_h,
)
topo = resolve_topology(topo_path)
engine = GraphEngine(topo.topology_obj, enable_data=True)
@@ -220,3 +267,584 @@ def test_allreduce(tmp_path, algorithm, sip_topology, n_sips):
algorithm=algorithm, ccl_yaml=ccl_path,
)
assert result["ok_cubes"] > 0
# ── Latency sweep (parametrized + xdist-friendly) ─────────────────────
# avoid 16 (== n_cubes, dim_map collision). Goes up to 96 KB per PE:
# bytes_per_pe = n_elem * 2 (f16). 49152 elem * 2 = 96 KB / PE.
_SWEEP_N_ELEM = [
8, 32, 64, 128, 512, 1024, 2048,
4096, 8192, 16384, 32768, 49152,
]
_ELEM_BYTES_F16 = 2
_SWEEP_TOPOLOGIES = [
("intercube_allreduce", "ring_1d", 6, None, None),
("intercube_allreduce", "torus_2d", 6, 2, 3),
("intercube_allreduce", "mesh_2d_no_wrap", 6, 2, 3),
]
# Shared on-disk staging dir for parametrized sweep rows. Each
# parametrized invocation writes one JSON file here; the aggregator
# (run from conftest.pytest_sessionfinish) reads them and emits the
# combined CSV + PNG plots.
_SWEEP_OUT_DIR = Path(__file__).parent / "allreduce_latency_plots"
_SWEEP_ROWS_DIR = _SWEEP_OUT_DIR / "_rows"
def _sweep_params():
out = []
for algorithm, sip_topology, n_sips, sip_w, sip_h in _SWEEP_TOPOLOGIES:
for n_elem in _SWEEP_N_ELEM:
out.append(pytest.param(
algorithm, sip_topology, n_sips, sip_w, sip_h, n_elem,
id=f"{sip_topology}-n_elem{n_elem}",
))
return out
@pytest.mark.parametrize(
"algorithm,sip_topology,n_sips,sip_w,sip_h,n_elem", _sweep_params(),
)
def test_allreduce_latency_one(
tmp_path, algorithm, sip_topology, n_sips, sip_w, sip_h, n_elem,
):
"""One config of the latency sweep. xdist parallelizes across params.
Writes a single JSON row to ``_SWEEP_ROWS_DIR``. The conftest
sessionfinish hook aggregates rows into CSV + plots after all
parametrized cases finish.
"""
import json
topo_path, ccl_path = _write_temp_configs(
tmp_path, sip_topology, n_sips, algorithm,
sip_w=sip_w, sip_h=sip_h,
n_elem_override=n_elem,
)
topo = resolve_topology(topo_path)
engine = GraphEngine(topo.topology_obj, enable_data=True)
spec = topo.topology_obj.spec
with RuntimeContext(
engine=engine,
target_device=DeviceSelector("all"),
correlation_id=f"sweep_{algorithm}_{sip_topology}_{n_elem}",
spec=spec,
) as ctx:
result = run_allreduce(
ctx, engine, spec,
algorithm=algorithm, ccl_yaml=ccl_path,
)
assert result["ok_cubes"] > 0
pe_exec_vals = [
float(tr.get("pe_exec_ns", 0.0) or 0.0)
for _, (_, tr) in engine._results.items()
if isinstance(tr, dict)
]
crit_ns = max(pe_exec_vals) if pe_exec_vals else 0.0
cm = spec["sip"]["cube_mesh"]
n_cubes = int(cm["w"]) * int(cm["h"])
bytes_per_sip = n_cubes * n_elem * _ELEM_BYTES_F16
bytes_per_pe = n_elem * _ELEM_BYTES_F16
record = {
"algorithm": algorithm,
"sip_topology": sip_topology,
"n_sips": n_sips,
"n_elem": n_elem,
"bytes_per_pe": bytes_per_pe,
"bytes_per_sip": bytes_per_sip,
"latency_ns": crit_ns,
}
_SWEEP_ROWS_DIR.mkdir(parents=True, exist_ok=True)
row_path = _SWEEP_ROWS_DIR / f"{sip_topology}_{n_elem}.json"
with open(row_path, "w", encoding="utf-8") as f:
json.dump(record, f)
def _aggregate_sweep_plots() -> bool:
"""Read all per-config rows and emit CSV + PNG plots.
Called by ``conftest.pytest_sessionfinish`` (controller node only).
Returns True if any rows were aggregated, False otherwise.
"""
import csv
import json
row_files = sorted(_SWEEP_ROWS_DIR.glob("*.json")) \
if _SWEEP_ROWS_DIR.exists() else []
records: list[dict] = []
if row_files:
for p in row_files:
with open(p, encoding="utf-8") as f:
records.append(json.load(f))
else:
# Fallback: replot from existing summary.csv (skip sweep re-run).
summary_path = _SWEEP_OUT_DIR / "summary.csv"
if not summary_path.exists():
return False
with open(summary_path, encoding="utf-8") as f:
for row in csv.DictReader(f):
records.append({
"algorithm": row["algorithm"],
"sip_topology": row["sip_topology"],
"n_sips": int(row["n_sips"]),
"n_elem": int(row["n_elem"]),
"bytes_per_pe": int(row["bytes_per_pe"]),
"bytes_per_sip": int(row["bytes_per_sip"]),
"latency_ns": float(row["latency_ns"]),
})
if not records:
return False
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
def _fmt_bytes(x, _pos):
if x <= 0:
return "0"
if x >= 1024 * 1024:
return f"{x / (1024 * 1024):.0f} MB"
if x >= 1024:
return f"{x / 1024:.0f} KB"
return f"{x:.0f} B"
_bytes_fmt = FuncFormatter(_fmt_bytes)
_SWEEP_OUT_DIR.mkdir(parents=True, exist_ok=True)
with open(_SWEEP_OUT_DIR / "summary.csv", "w",
newline="", encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=[
"algorithm", "sip_topology", "n_sips", "n_elem",
"bytes_per_pe", "bytes_per_sip", "latency_ns",
])
w.writeheader()
for r in sorted(records, key=lambda r: (
r["sip_topology"], r["bytes_per_pe"],
)):
w.writerow(r)
topologies = sorted({r["sip_topology"] for r in records})
for topo_name in topologies:
rs = sorted(
[r for r in records if r["sip_topology"] == topo_name],
key=lambda r: r["bytes_per_pe"],
)
if not rs:
continue
xs = [r["bytes_per_pe"] for r in rs]
ys = [r["latency_ns"] for r in rs]
title = (
f"Allreduce latency — {topo_name} "
f"(n_sips={rs[0]['n_sips']})"
)
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(xs, ys, marker="o", color="tab:blue")
ax.set_xscale("log", base=2)
ax.set_xlabel("Bytes per PE (log scale)")
ax.set_ylabel("max pe_exec_ns (critical path)")
ax.set_title(title)
ax.grid(True, alpha=0.3)
ax.xaxis.set_major_formatter(_bytes_fmt)
fig.tight_layout()
fig.savefig(_SWEEP_OUT_DIR / f"{topo_name}.png", dpi=120)
plt.close(fig)
colors = {"ring_1d": "tab:blue", "torus_2d": "tab:orange",
"mesh_2d_no_wrap": "tab:green"}
THEORETICAL_TORUS_2D_6SIP_NS = 10600.0
fig, ax = plt.subplots(figsize=(9, 6))
for topo_name in topologies:
rs = sorted(
[r for r in records if r["sip_topology"] == topo_name],
key=lambda r: r["bytes_per_pe"],
)
if not rs:
continue
ax.plot(
[r["bytes_per_pe"] for r in rs],
[r["latency_ns"] for r in rs],
marker="o",
label=f"{topo_name} (n_sips={rs[0]['n_sips']})",
color=colors.get(topo_name),
)
ax.axhline(
y=THEORETICAL_TORUS_2D_6SIP_NS,
color="tab:red", linestyle="--", linewidth=1.5,
label=f"theoretical torus_2d (6 SIPs) = "
f"{THEORETICAL_TORUS_2D_6SIP_NS:.0f} ns",
)
BYTES_96KB = 96 * 1024
ax.axvline(
x=BYTES_96KB, ymin=0, ymax=1,
color="tab:red", linestyle=":", linewidth=1.2,
)
ax.plot(
[BYTES_96KB], [THEORETICAL_TORUS_2D_6SIP_NS],
marker="x", color="tab:red", markersize=10, markeredgewidth=2,
)
# Find simulated torus_2d latency at 96 KB (if present) for direct
# comparison with the theoretical value.
sim_torus_at_96kb = next(
(r["latency_ns"] for r in records
if r["sip_topology"] == "torus_2d" and r["bytes_per_pe"] == BYTES_96KB),
None,
)
if sim_torus_at_96kb is not None:
ax.plot(
[BYTES_96KB], [sim_torus_at_96kb],
marker="o", color="tab:orange",
markersize=10, markeredgecolor="black", markeredgewidth=1.2,
)
ax.annotate(
f"96 KB\n"
f"theoretical = {THEORETICAL_TORUS_2D_6SIP_NS:.0f} ns\n"
f"simulated = {sim_torus_at_96kb:.0f} ns",
xy=(BYTES_96KB, sim_torus_at_96kb),
xytext=(10, -20), textcoords="offset points",
color="tab:red", fontsize=9,
)
else:
ax.annotate(
f"96 KB\n→ theoretical {THEORETICAL_TORUS_2D_6SIP_NS:.0f} ns",
xy=(BYTES_96KB, THEORETICAL_TORUS_2D_6SIP_NS),
xytext=(8, -20), textcoords="offset points",
color="tab:red", fontsize=9,
)
ax.set_xscale("log", base=2)
ax.set_xlabel("Bytes per PE (log scale)")
ax.set_ylabel("max pe_exec_ns (critical path)")
ax.set_title("Multi-device allreduce latency by topology")
ax.grid(True, alpha=0.3)
# Drop 128 KB tick (overlaps visually with the explicit 96 KB marker)
# and add 96 KB.
BYTES_128KB = 128 * 1024
existing_ticks = [t for t in ax.get_xticks() if int(t) != BYTES_128KB]
if BYTES_96KB not in existing_ticks:
existing_ticks.append(BYTES_96KB)
ax.set_xticks(sorted(existing_ticks))
ax.set_xlim(left=min(r["bytes_per_pe"] for r in records) / 2,
right=BYTES_96KB * 1.5)
ax.legend()
ax.xaxis.set_major_formatter(_bytes_fmt)
fig.tight_layout()
fig.savefig(_SWEEP_OUT_DIR / "overview.png", dpi=120)
plt.close(fig)
# Cleanup row staging dir so a partial future run doesn't pick up
# stale rows.
for p in row_files:
try:
p.unlink()
except OSError:
pass
try:
_SWEEP_ROWS_DIR.rmdir()
except OSError:
pass
print(f"\nWrote {_SWEEP_OUT_DIR / 'overview.png'} "
f"from {len(records)} rows")
return True
# ── Topology diagram (device-level + cube-level reduction) ────────────
# Convention: "rows × cols" everywhere, row-major rank assignment
# (rank = row * n_cols + col). For the 2×3 inter-SIP grid, this means
# 2 rows × 3 columns: SIP 0 1 2 / SIP 3 4 5.
_PALETTE_BG = "#fafbfd"
_PALETTE_FRAME = "#3a3f4a"
_PALETTE_BLUE = "#2c6fb6"
_PALETTE_GREEN = "#2e8a4e"
_PALETTE_TEXT = "#1f2530"
_PALETTE_BOX_FILL = "#eaf2fb"
_PALETTE_BOX_EDGE = "#2c4a78"
_PALETTE_ROOT_FILL = "#ffd9b8"
_PALETTE_ROOT_EDGE = "#bd5a14"
def _arrow(ax, xy_from, xy_to, color="black", lw=1.4, alpha=1.0,
style="-|>", curve=0.0):
from matplotlib.patches import FancyArrowPatch
arrow = FancyArrowPatch(
xy_from, xy_to,
arrowstyle=style, mutation_scale=12,
color=color, lw=lw, alpha=alpha,
connectionstyle=f"arc3,rad={curve}",
)
ax.add_patch(arrow)
def _draw_sip_box(ax, cx, cy, w, h, label, *, fill=_PALETTE_BOX_FILL,
edge=_PALETTE_BOX_EDGE, text_color=_PALETTE_TEXT,
font=10):
from matplotlib.patches import FancyBboxPatch
box = FancyBboxPatch(
(cx - w / 2, cy - h / 2), w, h,
boxstyle="round,pad=0.02,rounding_size=0.10",
linewidth=1.4, edgecolor=edge, facecolor=fill,
)
ax.add_patch(box)
ax.text(cx, cy, label, ha="center", va="center",
color=text_color, fontsize=font, fontweight="bold")
def _frame_panel(ax, title, lim_x=10.0, lim_y=6.0):
"""Set up a square-ish panel with a visible outer border."""
from matplotlib.patches import FancyBboxPatch
ax.set_xlim(0, lim_x)
ax.set_ylim(0, lim_y)
ax.set_aspect("equal")
ax.axis("off")
ax.set_facecolor(_PALETTE_BG)
border = FancyBboxPatch(
(0.05, 0.05), lim_x - 0.10, lim_y - 0.10,
boxstyle="round,pad=0.01,rounding_size=0.12",
linewidth=1.4, edgecolor=_PALETTE_FRAME, facecolor=_PALETTE_BG,
zorder=0,
)
ax.add_patch(border)
ax.set_title(title, fontsize=12, fontweight="bold",
color=_PALETTE_TEXT, pad=8)
def _draw_ring_topology(ax):
_frame_panel(ax, "ring_1d (6 SIPs)", lim_x=10.0, lim_y=6.0)
xs = [1.2, 2.7, 4.2, 5.7, 7.2, 8.7]
y = 3.1
box_w, box_h = 1.05, 0.9
for i, x in enumerate(xs):
_draw_sip_box(ax, x, y, box_w, box_h, f"SIP {i}")
# Forward ring (global_E) — adjacent neighbours, anchored to box edges.
for i in range(5):
_arrow(ax, (xs[i] + box_w / 2, y),
(xs[i + 1] - box_w / 2, y),
color=_PALETTE_BLUE, lw=1.6)
# Wrap (SIP 5 → SIP 0). Anchor at right-CENTER of SIP 5 and
# left-CENTER of SIP 0; arc OUTSIDE (above) the row so it does not
# overlap any of the SIP boxes in between.
_arrow(
ax,
(xs[5] + box_w / 2, y),
(xs[0] - box_w / 2, y),
color=_PALETTE_BLUE, lw=1.6, curve=-0.40,
)
ax.text(5.0, y + 2.0, "global_E (ring)", ha="center",
color=_PALETTE_BLUE, fontsize=10, style="italic")
ax.text(5.0, y - 1.5,
"(global_W = reverse direction, used by the algorithm)",
ha="center", color="gray", fontsize=8, style="italic")
def _draw_grid_topology(ax, kind, *, n_rows=2, n_cols=3):
"""kind ∈ {'torus', 'mesh'}. Lays out as n_rows × n_cols (row-major).
For the sweep we use 2 rows × 3 cols SIP layout::
row 0: SIP 0 SIP 1 SIP 2
row 1: SIP 3 SIP 4 SIP 5
"""
title = f"torus_2d ({n_rows}×{n_cols}, 6 SIPs)" if kind == "torus" \
else f"mesh_2d_no_wrap ({n_rows}×{n_cols}, 6 SIPs)"
_frame_panel(ax, title, lim_x=10.0, lim_y=6.0)
col_xs = [2.0, 5.0, 8.0] # 3 cols
row_ys = [4.3, 1.8] # 2 rows
box_w, box_h = 1.3, 0.95
pos: dict[tuple[int, int], tuple[float, float]] = {}
for r in range(n_rows):
for c in range(n_cols):
rank = r * n_cols + c
x, y = col_xs[c], row_ys[r]
pos[(r, c)] = (x, y)
_draw_sip_box(ax, x, y, box_w, box_h, f"SIP {rank}")
# Row edges (E↔W) — between adjacent columns within each row.
for r in range(n_rows):
for c in range(n_cols - 1):
x0, y0 = pos[(r, c)]
x1, y1 = pos[(r, c + 1)]
_arrow(ax, (x0 + box_w / 2, y0 + 0.10),
(x1 - box_w / 2, y1 + 0.10),
color=_PALETTE_BLUE, lw=1.5)
_arrow(ax, (x1 - box_w / 2, y1 - 0.10),
(x0 + box_w / 2, y0 - 0.10),
color=_PALETTE_BLUE, lw=1.5)
# Col edges (N↔S) — between adjacent rows within each column.
for c in range(n_cols):
for r in range(n_rows - 1):
x0, y0 = pos[(r, c)]
x1, y1 = pos[(r + 1, c)]
_arrow(ax, (x0 - 0.12, y0 - box_h / 2),
(x1 - 0.12, y1 + box_h / 2),
color=_PALETTE_GREEN, lw=1.5)
_arrow(ax, (x1 + 0.12, y1 + box_h / 2),
(x0 + 0.12, y0 - box_h / 2),
color=_PALETTE_GREEN, lw=1.5)
# Wrap arrows for torus only — anchor to the centre of the OUTER
# edge of the end SIPs and arc OUTSIDE the row/column so they do
# not overlap the SIPs in between.
if kind == "torus":
# Row wrap: last col → first col. Top row arcs UP, bottom row
# arcs DOWN, so each wrap sits clearly outside its own row.
for r in range(n_rows):
x0, y0 = pos[(r, 0)]
x1, y1 = pos[(r, n_cols - 1)]
curve = -0.45 if r == 0 else 0.45
_arrow(
ax,
(x1 + box_w / 2, y1),
(x0 - box_w / 2, y0),
color=_PALETTE_BLUE, lw=1.5,
curve=curve, alpha=0.9,
)
# Col wrap: last row → first row. Leftmost col arcs LEFT,
# rightmost col arcs RIGHT. Middle col(s) get a small inline
# marker + legend note (drawing them through the panel would
# collide with the row arrows).
for c in range(n_cols):
x0, y0 = pos[(0, c)]
x1, y1 = pos[(n_rows - 1, c)]
if c == 0:
curve = 0.55
elif c == n_cols - 1:
curve = -0.55
else:
continue # skip middle col — see legend note
_arrow(
ax,
(x1, y1 - box_h / 2),
(x0, y0 + box_h / 2),
color=_PALETTE_GREEN, lw=1.5,
curve=curve, alpha=0.9,
)
ax.text(0.7, 5.6, "global_E/W (row)", color=_PALETTE_BLUE,
fontsize=9, style="italic", fontweight="bold")
ax.text(0.7, 5.25, "global_N/S (col)", color=_PALETTE_GREEN,
fontsize=9, style="italic", fontweight="bold")
ax.text(0.7, 4.92,
"wrap = torus" if kind == "torus" else "no wrap = mesh",
color="gray", fontsize=8, style="italic")
if kind == "torus" and n_cols > 2:
ax.text(0.7, 0.3,
"(middle-col wrap omitted for clarity — every row "
"and every column wraps)",
color="gray", fontsize=7.5, style="italic")
def _draw_cube_reduction(ax):
"""4×4 cube grid inside SIP 0 — compact layout with phase legend."""
from matplotlib.patches import Rectangle
_frame_panel(ax, "Cube-level reduction inside SIP 0 (4×4 cubes)",
lim_x=10.0, lim_y=6.0)
cube_w = 0.65
cube_gap = 0.18
# Center the 4×4 grid in the left half of the panel.
grid_total = 4 * cube_w + 3 * cube_gap
grid_x0 = 0.7
grid_y0 = 0.7
centers: dict[tuple[int, int], tuple[float, float]] = {}
for r in range(4):
for c in range(4):
cx = grid_x0 + c * (cube_w + cube_gap) + cube_w / 2
cy = grid_y0 + (3 - r) * (cube_w + cube_gap) + cube_w / 2
centers[(r, c)] = (cx, cy)
cube_id = r * 4 + c
is_root = (r == 3 and c == 3)
face = _PALETTE_ROOT_FILL if is_root else _PALETTE_BOX_FILL
edge = _PALETTE_ROOT_EDGE if is_root else _PALETTE_BOX_EDGE
rect = Rectangle(
(cx - cube_w / 2, cy - cube_w / 2), cube_w, cube_w,
linewidth=1.2, edgecolor=edge, facecolor=face,
)
ax.add_patch(rect)
label = f"c{cube_id}"
ax.text(cx, cy, label, ha="center", va="center",
fontsize=7.5, fontweight="bold",
color=_PALETTE_ROOT_EDGE if is_root
else _PALETTE_TEXT)
# Phase 1: row reduce W→E.
for r in range(4):
for c in range(3):
x0, y0 = centers[(r, c)]
x1, y1 = centers[(r, c + 1)]
_arrow(ax, (x0 + cube_w / 2, y0), (x1 - cube_w / 2, y1),
color=_PALETTE_BLUE, lw=1.5)
# Phase 2: col reduce N→S along rightmost column.
for r in range(3):
x0, y0 = centers[(r, 3)]
x1, y1 = centers[(r + 1, 3)]
_arrow(ax, (x0, y0 - cube_w / 2), (x1, y1 + cube_w / 2),
color=_PALETTE_GREEN, lw=1.7)
# Phase legend on the right side.
legend_x = grid_x0 + grid_total + 0.55
ax.text(legend_x, 5.0, "Phase 1: row reduce (W → E)",
color=_PALETTE_BLUE, fontsize=10, fontweight="bold")
ax.text(legend_x, 4.55, "Phase 2: col reduce (N → S, rightmost col)",
color=_PALETTE_GREEN, fontsize=10, fontweight="bold")
ax.text(legend_x, 4.10, "Phase 3: inter-SIP exchange at root cube",
color=_PALETTE_ROOT_EDGE, fontsize=10, fontweight="bold")
ax.text(legend_x, 3.65, "Phase 4: col broadcast (S → N)",
color=_PALETTE_GREEN, fontsize=10, style="italic")
ax.text(legend_x, 3.20, "Phase 5: row broadcast (E → W)",
color=_PALETTE_BLUE, fontsize=10, style="italic")
ax.text(legend_x, 2.55,
"(broadcast phases reverse phases 2 & 1)",
color="gray", fontsize=8.5, style="italic")
ax.text(legend_x, 1.7,
"Root cube (c15, bottom-right) is the only\n"
"cube that performs the inter-SIP exchange.",
color=_PALETTE_ROOT_EDGE, fontsize=9, style="italic")
def emit_topology_diagram() -> str:
"""Emit a 2×2-panel topology diagram into allreduce_latency_plots/.
Top row: ring_1d | torus_2d (2×3)
Bot row: mesh_2d_no_wrap (2×3) | cube-level reduction in SIP 0
"""
import matplotlib.gridspec as gridspec
import matplotlib.pyplot as plt
_SWEEP_OUT_DIR.mkdir(parents=True, exist_ok=True)
fig = plt.figure(figsize=(16, 10), facecolor="white")
gs = gridspec.GridSpec(2, 2, figure=fig, hspace=0.30, wspace=0.10)
ax_ring = fig.add_subplot(gs[0, 0])
ax_torus = fig.add_subplot(gs[0, 1])
ax_mesh = fig.add_subplot(gs[1, 0])
ax_cube = fig.add_subplot(gs[1, 1])
_draw_ring_topology(ax_ring)
_draw_grid_topology(ax_torus, "torus", n_rows=2, n_cols=3)
_draw_grid_topology(ax_mesh, "mesh", n_rows=2, n_cols=3)
_draw_cube_reduction(ax_cube)
fig.suptitle(
"Allreduce topology — device-level (top: ring, torus, mesh) "
"and cube-level reduction in SIP 0",
fontsize=14, fontweight="bold", color=_PALETTE_TEXT, y=0.98,
)
out_path = _SWEEP_OUT_DIR / "topology.png"
fig.savefig(out_path, dpi=130, bbox_inches="tight",
facecolor=fig.get_facecolor())
plt.close(fig)
return str(out_path)
def test_emit_topology_diagram():
"""Emit topology.png alongside the sweep plots. Pure plotting; no sim."""
out = emit_topology_diagram()
assert Path(out).exists()
+1 -1
View File
@@ -23,7 +23,7 @@ def _engine():
def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
slice_bytes = 48 * (1 << 30) // 8
pa = PhysAddr.pe_hbm_addr(
rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
sip_id=sip, die_id=cube, pe_id=pe_id,
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
)
return pa.encode()
-48
View File
@@ -1,48 +0,0 @@
"""Test that tl.recv() (no direction) works under the mock runtime
and the SimPy PE_IPCQ component (ADR-0023 D4 weak fairness)."""
from __future__ import annotations
import numpy as np
from kernbench.ccl.testing import run_kernel_in_mock
def kernel_round_robin(t_ptr, n_elem, tl):
"""Each PE sends one tile E then receives N-1 tiles via round-robin.
Uses TensorHandle math (PE_MATH) so Phase 2 produces correct HBM
contents under SimPy + op_log replay."""
rank = tl.program_id(axis=0)
world_size = tl.num_programs(axis=0)
nbytes = n_elem * 2
pe_addr = t_ptr + rank * nbytes
acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
current = acc
for _step in range(world_size - 1):
tl.send(dir="E", src=current)
# No direction → round-robin
recv = tl.recv(shape=(n_elem,), dtype="f16")
acc = acc + recv
current = recv # forward W's tile to E next round
tl.store(pe_addr, acc)
def test_round_robin_recv_mock_runtime():
n_elem = 8
inputs = [
np.full((n_elem,), float(r + 1), dtype=np.float16)
for r in range(4)
]
expected = sum(inputs) # [10,...]
outputs = run_kernel_in_mock(
kernel_fn=kernel_round_robin,
world_size=4,
topology="ring_1d",
inputs=inputs,
kernel_args=(n_elem,),
)
for r in range(4):
assert np.allclose(outputs[r], expected)
+1 -1
View File
@@ -30,7 +30,7 @@ def _graph():
def _hbm_pa(pe_id: int = 0) -> int:
slice_bytes = 48 * (1 << 30) // 8
pa = PhysAddr.pe_hbm_addr(
rack_id=0, sip_id=0, cube_id=0, pe_id=pe_id,
sip_id=0, die_id=0, pe_id=pe_id,
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
)
return pa.encode()
+194
View File
@@ -0,0 +1,194 @@
"""ADR-0009 D5 invariant: all PEs targeted by a single kernel launch MUST
begin executing the kernel body at the same simulated time, regardless of
their dispatch path length.
These tests directly verify the invariant by capturing per-PE state at the
top of `_execute_kernel`:
test_no_pe_arrives_after_target_start_ns
Asserts: for every PE that enters _execute_kernel during a multi-cube
launch, `env.now` at entry must be <= target_start_ns. Otherwise the
PE's barrier yield would be a no-op and `pe_exec_start` would be set
late, breaking the D5 "same simulated time" mandate.
test_all_pes_have_identical_pe_exec_start
Asserts: every PE's `pe_exec_start` (the value of `env.now` recorded
immediately AFTER the barrier yield) is identical across all PEs in
the launch.
Both tests are expected to FAIL today and become the regression check the
Phase 2 D5 predictor + fallback fix must make pass.
"""
from __future__ import annotations
from pathlib import Path
import numpy as np
import pytest
from kernbench.policy.placement.dp import DPPolicy
from kernbench.runtime_api.context import RuntimeContext
from kernbench.runtime_api.types import DeviceSelector
from kernbench.sim_engine.engine import GraphEngine
from kernbench.topology.builder import resolve_topology
TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
def _capture_per_pe_d5_state():
"""Monkey-patch PeCpuComponent._execute_kernel to record, per PE:
- entry_now: env.now at function entry (before any yield)
- target_start_ns: the value carried by the request
- barrier_yielded: True if the barrier yield fired (entry_now < target)
- pe_exec_start: env.now immediately after the barrier check
(i.e. the value the original code sets)
Returns (records: list[dict], restore: callable).
"""
import kernbench.components.builtin.pe_cpu as pe_cpu_mod
records: list[dict] = []
original = pe_cpu_mod.PeCpuComponent._execute_kernel
def patched(self, env, txn):
request = txn.request
target_start = getattr(request, "target_start_ns", None)
entry_now = float(env.now)
rec = {
"node_id": self.node.id,
"entry_now": entry_now,
"target_start_ns": (
float(target_start) if target_start is not None else None
),
"barrier_yielded": (
target_start is not None
and float(target_start) > entry_now
),
"pe_exec_start": None, # filled below by sniff
"late_ns": (
None if target_start is None
else max(0.0, entry_now - float(target_start))
),
}
records.append(rec)
# We can't easily inject a callback at the original's
# `pe_exec_start = env.now` line without rewriting it. Approximate:
# if the original yields the barrier, env.now after the yield is
# target_start_ns; otherwise pe_exec_start is entry_now (skipped).
if rec["barrier_yielded"]:
rec["pe_exec_start"] = float(target_start)
else:
rec["pe_exec_start"] = entry_now
yield from original(self, env, txn)
pe_cpu_mod.PeCpuComponent._execute_kernel = patched
def restore():
pe_cpu_mod.PeCpuComponent._execute_kernel = original
return records, restore
def _run_multicube_launch():
"""Drive a no-op kernel launch across all 16 cubes x 8 PEs and return
the per-PE D5 records collected by the monkey-patch."""
records, restore = _capture_per_pe_d5_state()
try:
topo = resolve_topology(str(TOPOLOGY_PATH))
engine = GraphEngine(topo.topology_obj, enable_data=True)
spec = topo.topology_obj.spec
with RuntimeContext(
engine=engine, target_device=DeviceSelector("all"),
correlation_id="d5_barrier", spec=spec,
) as ctx:
dp = DPPolicy(
cube="row_wise", pe="column_wise",
num_cubes=16, num_pes=8,
)
def kernel(t_ptr, n_elem, tl):
pass # no-op
ctx.ahbm.set_device(0)
t = ctx.zeros(
(16, 8 * 64), dtype="f16", dp=dp, name="probe",
)
t.copy_(ctx.from_numpy(
np.zeros((16, 8 * 64), dtype=np.float16),
))
pending = ctx.launch(
"d5_probe", kernel, t, 64, _defer_wait=True,
)
for h, _sip, meta in pending:
ctx.wait(h, _meta=meta)
finally:
restore()
return records
def test_no_pe_arrives_after_target_start_ns():
"""ADR-0009 D5: no PE may enter `_execute_kernel` after target_start_ns.
Today this fails because IO_CPU's predictor under-shoots actual
dispatch latency for far cubes (cube4, cube9-15). Phase 2 fix:
chain-aware predictor in IO_CPU + monotonic upward re-stamp in M_CPU.
"""
records = _run_multicube_launch()
assert records, "expected per-PE _execute_kernel records"
late = [
r for r in records
if r["target_start_ns"] is not None
and r["late_ns"] is not None
and r["late_ns"] > 1e-6
]
if late:
# Provide actionable diagnostic in the failure.
worst = sorted(late, key=lambda r: -r["late_ns"])[:5]
details = "\n".join(
f" {r['node_id']}: late by {r['late_ns']:.2f} ns "
f"(entry_now={r['entry_now']:.2f}, "
f"target_start_ns={r['target_start_ns']:.2f})"
for r in worst
)
pytest.fail(
f"ADR-0009 D5 violated: {len(late)}/{len(records)} PEs "
f"entered _execute_kernel AFTER target_start_ns "
f"(barrier yield silently skipped). "
f"Worst offenders:\n{details}"
)
def test_all_pes_have_identical_pe_exec_start():
"""ADR-0009 D5: every PE's pe_exec_start must be identical.
With D5 honored, every PE either yields to target_start_ns (start =
target_start_ns) or, if late, would still be aligned by the M_CPU
upward re-stamp (Phase 2). Today: 75/128 PEs in this launch have
distinct pe_exec_start values because they skipped the barrier.
"""
records = _run_multicube_launch()
assert records, "expected per-PE _execute_kernel records"
starts = sorted({round(r["pe_exec_start"], 6) for r in records})
if len(starts) > 1:
spread = max(starts) - min(starts)
# Distribution of how many PEs at each distinct start time
from collections import Counter
bucket = Counter(round(r["pe_exec_start"], 6) for r in records)
details = "\n".join(
f" pe_exec_start={t}: {n} PEs"
for t, n in sorted(bucket.items())
)
pytest.fail(
f"ADR-0009 D5 violated: PEs have {len(starts)} distinct "
f"pe_exec_start values (spread = {spread:.2f} ns); "
f"D5 mandates a single common value. "
f"Distribution:\n{details}"
)
+1 -1
View File
@@ -50,7 +50,7 @@ def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
from kernbench.policy.address.phyaddr import PhysAddr
slice_bytes = 48 * (1 << 30) // 8
pa = PhysAddr.pe_hbm_addr(
rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
sip_id=sip, die_id=cube, pe_id=pe_id,
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
)
return pa.encode()
+1 -1
View File
@@ -31,7 +31,7 @@ def _hbm_pa(sip=0, cube=0, pe_id=0):
from kernbench.policy.address.phyaddr import PhysAddr
slice_bytes = 48 * (1 << 30) // 8
pa = PhysAddr.pe_hbm_addr(
rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
sip_id=sip, die_id=cube, pe_id=pe_id,
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
)
return pa.encode()
+2 -2
View File
@@ -29,7 +29,7 @@ def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
# 48 GB / 8 slices = 6 GB per slice
slice_bytes = 48 * (1 << 30) // 8
pa = PhysAddr.pe_hbm_addr(
rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
sip_id=sip, die_id=cube, pe_id=pe_id,
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
)
return pa.encode()
@@ -37,7 +37,7 @@ def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
def _sram_pa(sip: int = 0, cube: int = 0) -> int:
"""Create an SRAM physical address."""
pa = PhysAddr.cube_sram_addr(rack_id=0, sip_id=sip, cube_id=cube, sram_offset=0x800)
pa = PhysAddr.cube_sram_addr(sip_id=sip, die_id=cube, sram_offset=0x800)
return pa.encode()
+86 -60
View File
@@ -1,8 +1,9 @@
"""Tests for configure_sfr_intercube_multisip neighbor table wiring.
Verifies that IPCQ neighbor tables are correctly installed for
intercube (pe0, 4×4 mesh N/S/E/W) + inter-SIP (pe0, all cubes,
global_E/global_W) communication.
Verifies full IPCQ hardware wiring (independent of DPPolicy):
- intra-cube (2×4 PE grid) intra_N/S/E/W
- intercube same-lane N/S/E/W
- inter-SIP same-(cube, pe) global_N/S/E/W
"""
from __future__ import annotations
@@ -16,6 +17,7 @@ from kernbench.topology.builder import resolve_topology
TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
N_CUBES = 16
PES_PER_CUBE = 8
def _engine_and_spec():
@@ -36,78 +38,102 @@ class TestConfigureSfrNeighborTables:
plan = configure_sfr_intercube_multisip(engine, spec, cfg)
n_sips = int(spec["system"]["sips"]["count"])
assert plan["world_size"] == n_sips * N_CUBES
assert len(plan["rank_to_pe"]) == n_sips * N_CUBES
for pe_idx, (sip, cube, pe) in enumerate(plan["rank_to_pe"]):
assert pe == 0, f"pe_idx {pe_idx}: pe must be 0, got {pe}"
expected = n_sips * N_CUBES * PES_PER_CUBE
assert plan["world_size"] == expected
assert len(plan["rank_to_pe"]) == expected
def test_corner_cube0_has_E_and_S_only(self):
"""Cube 0 (row=0, col=0) is NW corner: only E and S neighbors."""
# ── Intra-cube (intra_N/S/E/W) ────────────────────────────────
def test_pe0_intra_cube_has_intra_E_and_intra_S(self):
"""pe0 is NW of the 2×4 PE grid: intra_E=pe1, intra_S=pe4."""
engine, spec = _engine_and_spec()
cfg = _merged_cfg()
configure_sfr_intercube_multisip(engine, spec, cfg)
ipcq = engine._components["sip0.cube0.pe0.pe_ipcq"]
qp = ipcq.queue_pairs
assert "E" in qp, "cube 0 must have E neighbor"
assert "S" in qp, "cube 0 must have S neighbor"
assert "W" not in qp, "cube 0 (col=0) must NOT have W neighbor"
assert "N" not in qp, "cube 0 (row=0) must NOT have N neighbor"
qp = engine._components["sip0.cube0.pe0.pe_ipcq"].queue_pairs
assert "intra_E" in qp
assert qp["intra_E"]["peer"].pe == 1
assert "intra_S" in qp
assert qp["intra_S"]["peer"].pe == 4
assert "intra_W" not in qp
assert "intra_N" not in qp
def test_pe5_intra_cube_has_all_four(self):
"""pe5 (row=1, col=1 in 2×4 grid) has all 4 intra directions.
Intra neighbors: intra_N=pe1, intra_E=pe6, intra_W=pe4,
intra_S not present (row=1 is bottom row).
"""
engine, spec = _engine_and_spec()
cfg = _merged_cfg()
configure_sfr_intercube_multisip(engine, spec, cfg)
qp = engine._components["sip0.cube0.pe5.pe_ipcq"].queue_pairs
assert qp["intra_N"]["peer"].pe == 1
assert qp["intra_E"]["peer"].pe == 6
assert qp["intra_W"]["peer"].pe == 4
assert "intra_S" not in qp # bottom row
# ── Intercube same-lane (N/S/E/W) ─────────────────────────────
def test_corner_cube0_pe0_has_intercube_E_and_S(self):
"""Cube 0 (NW mesh corner): intercube E→cube1, S→cube4."""
engine, spec = _engine_and_spec()
cfg = _merged_cfg()
configure_sfr_intercube_multisip(engine, spec, cfg)
qp = engine._components["sip0.cube0.pe0.pe_ipcq"].queue_pairs
assert qp["E"]["peer"].cube == 1
assert qp["E"]["peer"].pe == 0 # same-lane
assert qp["S"]["peer"].cube == 4
assert qp["S"]["peer"].pe == 0
assert "W" not in qp, "cube 0 has no west neighbor"
assert "N" not in qp, "cube 0 has no north neighbor"
def test_interior_cube5_has_all_four(self):
"""Cube 5 (row=1, col=1) is interior: N/S/E/W all present."""
def test_interior_cube5_pe3_has_all_four_intercube_same_lane(self):
"""Cube 5 interior, pe3: intercube N/S/E/W all present, same-lane."""
engine, spec = _engine_and_spec()
cfg = _merged_cfg()
configure_sfr_intercube_multisip(engine, spec, cfg)
ipcq = engine._components["sip0.cube5.pe0.pe_ipcq"]
qp = ipcq.queue_pairs
assert qp["N"]["peer"].cube == 1
assert qp["S"]["peer"].cube == 9
assert qp["E"]["peer"].cube == 6
assert qp["W"]["peer"].cube == 4
qp = engine._components["sip0.cube5.pe3.pe_ipcq"].queue_pairs
for d, expected_cube in [("N", 1), ("S", 9), ("E", 6), ("W", 4)]:
assert qp[d]["peer"].cube == expected_cube
assert qp[d]["peer"].pe == 3 # same-lane
def test_root_cube15_has_inter_sip(self):
"""Cube 15 (root, SE corner) has N, W + global_E/global_W."""
def test_all_pes_have_intercube_wiring(self):
"""Every PE on every interior cube has intercube same-lane wiring."""
engine, spec = _engine_and_spec()
cfg = _merged_cfg()
configure_sfr_intercube_multisip(engine, spec, cfg)
ipcq0 = engine._components["sip0.cube15.pe0.pe_ipcq"]
qp0 = ipcq0.queue_pairs
assert "N" in qp0
assert "W" in qp0
assert "E" not in qp0, "cube 15 (col=3) must NOT have E"
assert "S" not in qp0, "cube 15 (row=3) must NOT have S"
assert "global_E" in qp0, "root cube must have global_E"
assert "global_W" in qp0, "root cube must have global_W"
assert qp0["global_E"]["peer"].sip == 1
assert qp0["global_E"]["peer"].cube == 15
ipcq1 = engine._components["sip1.cube15.pe0.pe_ipcq"]
qp1 = ipcq1.queue_pairs
assert qp1["global_E"]["peer"].sip == 0
assert qp1["global_E"]["peer"].cube == 15
def test_all_cubes_have_inter_sip(self):
"""ALL cubes (not just root) are wired for inter-SIP."""
engine, spec = _engine_and_spec()
cfg = _merged_cfg()
configure_sfr_intercube_multisip(engine, spec, cfg)
root_cube = int(cfg.get("root_cube", N_CUBES - 1))
for cube_id in range(N_CUBES):
ipcq = engine._components[f"sip0.cube{cube_id}.pe0.pe_ipcq"]
qp = ipcq.queue_pairs
assert "global_E" in qp, (
f"sip0.cube{cube_id}.pe0 missing global_E"
)
assert "global_W" in qp, (
f"sip0.cube{cube_id}.pe0 missing global_W"
)
if cube_id == root_cube:
assert qp["global_E"]["peer"].sip != 0, (
f"root cube {root_cube} global_E must point to another SIP"
# Interior cube 5: every PE should have N/S/E/W same-lane.
for pe in range(PES_PER_CUBE):
qp = engine._components[f"sip0.cube5.pe{pe}.pe_ipcq"].queue_pairs
for d in ("N", "S", "E", "W"):
assert d in qp, f"sip0.cube5.pe{pe} missing intercube {d}"
assert qp[d]["peer"].pe == pe, (
f"sip0.cube5.pe{pe} {d} not same-lane"
)
# ── Inter-SIP (global_*) ──────────────────────────────────────
def test_every_pe_on_every_cube_has_inter_sip(self):
"""All PEs on all cubes wired for inter-SIP via global_*."""
engine, spec = _engine_and_spec()
cfg = _merged_cfg()
configure_sfr_intercube_multisip(engine, spec, cfg)
for cube_id in range(N_CUBES):
for pe in range(PES_PER_CUBE):
qp = engine._components[
f"sip0.cube{cube_id}.pe{pe}.pe_ipcq"
].queue_pairs
assert "global_E" in qp, (
f"sip0.cube{cube_id}.pe{pe} missing global_E"
)
assert "global_W" in qp
# Peer must be same (cube, pe) on another SIP.
assert qp["global_E"]["peer"].sip == 1
assert qp["global_E"]["peer"].cube == cube_id
assert qp["global_E"]["peer"].pe == pe
+1 -1
View File
@@ -36,7 +36,7 @@ def _engine():
def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
slice_bytes = 48 * (1 << 30) // 8
pa = PhysAddr.pe_hbm_addr(
rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
sip_id=sip, die_id=cube, pe_id=pe_id,
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
)
return pa.encode()
+62
View File
@@ -0,0 +1,62 @@
"""ADR-0009 D5: synchronized launch barrier.
M_CPU stamps KernelLaunchMsg with target_start_ns = env.now + max path
latency; PE_CPU yields until that time before recording pe_exec_start.
Every PE in a single launch MUST begin kernel execution at the same
env.now regardless of its dispatch path length.
We verify this indirectly: for a no-op kernel, pe_exec_ns = env.now -
pe_exec_start. If every PE's pe_exec_start is identical and every PE
runs the same no-op body, every pe_exec_ns value must be identical.
Without D5, pe_exec_start varies by dispatch-path length and so does
pe_exec_ns.
"""
from __future__ import annotations
from pathlib import Path
import numpy as np
from kernbench.policy.placement.dp import DPPolicy
from kernbench.runtime_api.context import RuntimeContext
from kernbench.runtime_api.types import DeviceSelector
from kernbench.sim_engine.engine import GraphEngine
from kernbench.topology.builder import resolve_topology
TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
def test_kernel_launch_sync_all_pes_have_equal_exec_time():
"""No-op kernel: every PE's pe_exec_ns must be identical under D5."""
topo = resolve_topology(str(TOPOLOGY_PATH))
engine = GraphEngine(topo.topology_obj, enable_data=True)
spec = topo.topology_obj.spec
with RuntimeContext(engine=engine, target_device=DeviceSelector("all"),
correlation_id="sync_test", spec=spec) as ctx:
dp = DPPolicy(cube="row_wise", pe="column_wise",
num_cubes=16, num_pes=8)
def kernel(t_ptr, n_elem, tl):
pass # no-op
ctx.ahbm.set_device(0)
t = ctx.zeros((16, 8 * 64), dtype="f16", dp=dp, name="probe")
t.copy_(ctx.from_numpy(np.zeros((16, 8 * 64), dtype=np.float16)))
pending = ctx.launch("sync_probe", kernel, t, 64, _defer_wait=True)
for h, _sip, meta in pending:
ctx.wait(h, _meta=meta)
pe_exec_vals = []
for h, _sip, _meta in pending:
_, trace = engine.get_completion(h)
if trace and trace.get("pe_exec_ns") is not None:
pe_exec_vals.append(float(trace["pe_exec_ns"]))
assert pe_exec_vals, "expected completion traces with pe_exec_ns"
spread = max(pe_exec_vals) - min(pe_exec_vals)
assert spread < 1e-6, (
f"ADR-0009 D5 violated: pe_exec_ns spread across PEs = "
f"{spread:.6f} ns (expected 0). Values: {pe_exec_vals}"
)
+1 -1
View File
@@ -38,7 +38,7 @@ def _engine():
def _hbm_pa(sip=0, cube=0, pe_id=0):
slice_bytes = 48 * (1 << 30) // 8
pa = PhysAddr.pe_hbm_addr(
rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
sip_id=sip, die_id=cube, pe_id=pe_id,
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
)
return pa.encode()
+1 -1
View File
@@ -53,7 +53,7 @@ def _engine():
def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
slice_bytes = 48 * (1 << 30) // 8
pa = PhysAddr.pe_hbm_addr(
rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
sip_id=sip, die_id=cube, pe_id=pe_id,
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
)
return pa.encode()
+741
View File
@@ -0,0 +1,741 @@
"""Diagnostic for the inter-cube RAW > IPCQ asymmetry on h3/h4 plots.
Single-shot run at h3 (sip0.cube0.pe0 -> sip0.cube1.pe0), nbytes=4096.
Captures per-PE pe_exec_ns and the actual path / drain / per-node overhead
breakdown for the RAW sub-txn (PE_DMA -> remote HBM_CTRL) vs the IPCQ
outbound sub-txn (PE_DMA -> peer PE_DMA), so we can localize the gap to
one of:
(a) drain at HBM-BW (RAW) vs fabric-BW (IPCQ)
(b) path-length / per-node overhead asymmetry
(c) RAW SRC paying tl.load (local HBM read) on top of remote tl.store
while IPCQ DST only pays inbound traversal+drain.
Phase 1 / test-only. No production code is modified.
"""
from __future__ import annotations
from pathlib import Path
import numpy as np
import pytest
from kernbench.ccl.install import load_ccl_config, resolve_algorithm_config
from kernbench.ccl.sfr_config import configure_sfr_intercube_multisip
from kernbench.policy.placement.dp import DPPolicy
from kernbench.runtime_api.context import RuntimeContext
from kernbench.runtime_api.types import DeviceSelector
from kernbench.sim_engine.engine import GraphEngine
from kernbench.topology.builder import resolve_topology
TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
import os
# Allow the test to be re-run for h4 (inter-cube vertical) at multiple sizes
# to investigate why IPCQ slope flattens past 8192 B (path may differ).
NBYTES = int(os.environ.get("DIAG_NBYTES", "4096"))
ELEM_BYTES = 2
N_ELEM = NBYTES // ELEM_BYTES
N_CUBES = 16
N_PES = 8
HOP = os.environ.get("DIAG_HOP", "h3")
if HOP == "h4":
SRC = (0, 0, 0)
DST = (0, 4, 0) # h4 inter-cube vertical
else:
SRC = (0, 0, 0)
DST = (0, 1, 0) # h3 inter-cube horizontal
# ── Per-PE pe_exec_ns capture via monkey-patch ───────────────────────
def _install_barrier_capture():
"""Wrap PeCpuComponent._execute_kernel to log, for every PE that
enters: env.now at entry, target_start_ns the request carried,
whether the barrier yield fired (i.e. env.now < target_start_ns),
and env.now at pe_exec_start.
"""
import kernbench.components.builtin.pe_cpu as pe_cpu_mod
log: list[dict] = []
original = pe_cpu_mod.PeCpuComponent._execute_kernel
def patched(self, env, txn):
request = txn.request
target_start = getattr(request, "target_start_ns", None)
entry_now = float(env.now)
log_entry = {
"node_id": self.node.id,
"entry_now": entry_now,
"target_start_ns": (
float(target_start) if target_start is not None else None
),
"barrier_skipped": (
target_start is None
or float(target_start) <= entry_now
),
"delta_late_ns": (
None if target_start is None
else max(0.0, entry_now - float(target_start))
),
}
log.append(log_entry)
yield from original(self, env, txn)
pe_cpu_mod.PeCpuComponent._execute_kernel = patched
def restore():
pe_cpu_mod.PeCpuComponent._execute_kernel = original
return log, restore
def _install_per_pe_capture():
"""Wrap PeCpuComponent._execute_kernel so we record (node_id ->
pe_exec_ns) for every PE that executes a kernel during the run.
Returns (capture_dict, restore_callable).
"""
import kernbench.components.builtin.pe_cpu as pe_cpu_mod
captured: dict[str, float] = {}
original = pe_cpu_mod.PeCpuComponent._execute_kernel
def patched(self, env, txn):
gen = original(self, env, txn)
try:
value = yield from gen
finally:
v = txn.result_data.get("pe_exec_ns")
if v is not None:
captured[self.node.id] = float(v)
return value
pe_cpu_mod.PeCpuComponent._execute_kernel = patched
def restore():
pe_cpu_mod.PeCpuComponent._execute_kernel = original
return captured, restore
def _install_recv_capture(target_node_id: str):
"""Wrap PeIpcqComponent._handle_recv to log entry/exit times and the
peer_head_cache/my_tail values seen at the start.
This pins down whether recv ever blocked on a wait_event, or whether
it consumed without waiting (i.e. peer_head_cache > my_tail at entry).
"""
import kernbench.components.builtin.pe_ipcq as pe_ipcq_mod
log: list[dict] = []
original = pe_ipcq_mod.PeIpcqComponent._handle_recv
def patched(self, env, req, cmd):
if self.node.id != target_node_id:
yield from original(self, env, req, cmd)
return
# Snapshot state before dispatch
d = cmd.direction
qp = self._queue_pairs.get(d, {})
log.append({
"phase": "enter",
"t": float(env.now),
"direction": d,
"peer_head_cache": qp.get("peer_head_cache"),
"my_tail": qp.get("my_tail"),
})
yield from original(self, env, req, cmd)
qp = self._queue_pairs.get(d, {})
log.append({
"phase": "exit",
"t": float(env.now),
"direction": d,
"peer_head_cache": qp.get("peer_head_cache"),
"my_tail": qp.get("my_tail"),
})
pe_ipcq_mod.PeIpcqComponent._handle_recv = patched
def restore():
pe_ipcq_mod.PeIpcqComponent._handle_recv = original
return log, restore
def _install_meta_arrival_capture(target_node_id: str):
"""Log every IpcqMetaArrival that lands on ``target_node_id`` PE_IPCQ.
Records (env_now, sender_seq, dst_addr, matched_direction,
peer_head_cache_before, my_tail_before).
"""
import kernbench.components.builtin.pe_ipcq as pe_ipcq_mod
log: list[dict] = []
original = pe_ipcq_mod.PeIpcqComponent._handle_meta_arrival
def patched(self, msg):
if self.node.id == target_node_id:
token = msg.token
now = float(self._env.now) if hasattr(self, "_env") else 0.0
# _env is not stored on the component; use ctx? Fall back to
# introspection via self._inbox._env (SimPy stores reference).
try:
now = float(self._inbox._env.now)
except Exception:
pass
entry = {
"t": now,
"sender_seq": getattr(token, "sender_seq", None),
"dst_addr": getattr(token, "dst_addr", None),
"src_sip": getattr(token, "src_sip", None),
"src_cube": getattr(token, "src_cube", None),
"src_pe": getattr(token, "src_pe", None),
"src_direction": getattr(token, "src_direction", None),
"nbytes": getattr(token, "nbytes", None),
"matched_direction": None,
"peer_head_cache_before": {},
"my_tail_before": {},
}
for d, qp in self._queue_pairs.items():
entry["peer_head_cache_before"][d] = qp["peer_head_cache"]
entry["my_tail_before"][d] = qp["my_tail"]
base = qp["my_rx_base_pa"]
size = qp["n_slots"] * qp["slot_size"]
if base <= entry["dst_addr"] < base + size:
entry["matched_direction"] = d
log.append(entry)
return original(self, msg)
pe_ipcq_mod.PeIpcqComponent._handle_meta_arrival = patched
def restore():
pe_ipcq_mod.PeIpcqComponent._handle_meta_arrival = original
return log, restore
def _snapshot_qp_state(engine, target_node_id: str) -> dict:
"""Snapshot every direction's qp state on the target PE_IPCQ now.
Captures peer_head_cache, my_tail, my_rx_base_pa, n_slots, slot_size
for each installed direction.
"""
comp = engine._components.get(target_node_id)
if comp is None:
return {}
return {
d: {
"peer_head_cache": qp["peer_head_cache"],
"my_tail": qp["my_tail"],
"my_rx_base_pa": qp["my_rx_base_pa"],
"n_slots": qp["n_slots"],
"slot_size": qp["slot_size"],
"rx_range": (
qp["my_rx_base_pa"],
qp["my_rx_base_pa"] + qp["n_slots"] * qp["slot_size"],
),
}
for d, qp in comp.queue_pairs.items()
}
# ── Path / drain breakdown using engine ctx ──────────────────────────
def _path_breakdown(ctx, path: list[str], nbytes: int) -> dict:
edge_total_ns = 0.0
edge_details = []
min_bw = float("inf")
for i in range(len(path) - 1):
edge = ctx.edge_map.get((path[i], path[i + 1]))
if edge is None:
edge_details.append((path[i], path[i + 1], None, None, None))
continue
prop_ns = edge.distance_mm * ctx.ns_per_mm
edge_total_ns += prop_ns
bw = getattr(edge, "bw_gbs", None) or 0.0
if bw > 0 and bw < min_bw:
min_bw = bw
edge_details.append(
(path[i], path[i + 1], edge.distance_mm, prop_ns, bw),
)
overhead_total_ns = 0.0
overhead_details = []
for nid in path:
oh = float(ctx.node_overhead_ns.get(nid, 0.0))
overhead_total_ns += oh
overhead_details.append((nid, oh))
drain_ns = ctx.compute_drain_ns(path, nbytes)
bottleneck_bw = None if min_bw == float("inf") else min_bw
return {
"path": path,
"edges": edge_details,
"edge_total_ns": edge_total_ns,
"overheads": overhead_details,
"overhead_total_ns": overhead_total_ns,
"drain_ns": drain_ns,
"bottleneck_bw_gbs": bottleneck_bw,
"expected_total_ns": edge_total_ns + overhead_total_ns + drain_ns,
}
def _print_breakdown(label: str, br: dict) -> None:
print(f"\n {label}")
print(f" path ({len(br['path'])} nodes):")
for nid in br["path"]:
print(f" - {nid}")
print(f" edges (prop. delay):")
for src, dst, dist_mm, prop_ns, bw in br["edges"]:
if dist_mm is None:
print(f" ! {src} -> {dst} EDGE NOT FOUND IN edge_map")
continue
print(
f" {src} -> {dst} "
f"dist={dist_mm:.3f}mm prop={prop_ns:.2f}ns "
f"bw={bw or 0:.2f}GB/s"
)
print(f" per-node overhead_ns:")
for nid, oh in br["overheads"]:
if oh > 0:
print(f" {nid:<60s} overhead_ns={oh:.2f}")
print(f" edge_total_ns = {br['edge_total_ns']:.2f}")
print(f" overhead_total_ns = {br['overhead_total_ns']:.2f}")
print(f" bottleneck_bw_gbs = {br['bottleneck_bw_gbs']}")
print(f" drain_ns (nbytes={NBYTES}) = {br['drain_ns']:.2f}")
print(f" expected_total_ns = {br['expected_total_ns']:.2f}")
# ── RAW path scenario ────────────────────────────────────────────────
def _dump_src_op_records(engine, src_sip, src_cube, src_pe, label) -> None:
"""Print op_logger records for ops on the SRC PE.
The op log captures t_start/t_end for memory/math/gemm/copy ops on
every component, so we can see how long tl.load vs tl.store vs
tl.send actually took at the engine level.
"""
op_logger = getattr(engine, "_op_logger", None)
if op_logger is None:
print(f" ({label}) op_logger not available")
return
src_prefix = f"sip{src_sip}.cube{src_cube}.pe{src_pe}."
recs = [r for r in op_logger.records if r.component_id.startswith(src_prefix)]
print(f" ({label}) op_logger records on SRC PE ({src_prefix}*):")
for r in recs[:40]:
dur = r.t_end - r.t_start
comp_short = r.component_id.replace(src_prefix, "")
params_short = ""
if "nbytes" in r.params:
params_short = f" nbytes={r.params['nbytes']}"
if "src_addr" in r.params:
params_short += f" src_addr={r.params['src_addr']}"
if "dst_addr" in r.params:
params_short += f" dst_addr={r.params['dst_addr']}"
print(
f" t=[{r.t_start:7.2f}..{r.t_end:7.2f}] dur={dur:6.2f}ns "
f"{comp_short:<25s} {r.op_kind:<8s} {r.op_name:<12s}{params_short}"
)
def _run_raw():
captured, restore = _install_per_pe_capture()
try:
topo = resolve_topology(str(TOPOLOGY_PATH))
engine = GraphEngine(topo.topology_obj, enable_data=True)
spec = topo.topology_obj.spec
src_sip, src_cube, src_pe = SRC
dst_sip, dst_cube, dst_pe = DST
assert src_sip == dst_sip
src_off = (src_cube * N_PES + src_pe) * N_ELEM * ELEM_BYTES
dst_off = (dst_cube * N_PES + dst_pe) * N_ELEM * ELEM_BYTES
with RuntimeContext(
engine=engine,
target_device=DeviceSelector("all"),
correlation_id="diag_raw",
spec=spec,
) as rt:
dp = DPPolicy(
cube="row_wise", pe="column_wise",
num_cubes=N_CUBES, num_pes=N_PES,
)
rt.ahbm.set_device(src_sip)
t = rt.zeros(
(N_CUBES, N_PES * N_ELEM), dtype="f16",
dp=dp, name="raw_tensor",
)
t.copy_(rt.from_numpy(
np.full((N_CUBES, N_PES * N_ELEM), 1.0, dtype=np.float16),
))
def kernel(t_ptr, n_elem, tl):
pe_id = tl.program_id(axis=0)
cube_id = tl.program_id(axis=1)
if cube_id == src_cube and pe_id == src_pe:
data = tl.load(
t_ptr + src_off, shape=(n_elem,), dtype="f16",
)
tl.store(t_ptr + dst_off, data)
pending = rt.launch(
"diag_raw_kernel", kernel, t, N_ELEM, _defer_wait=True,
)
for h, _sip, meta in pending:
rt.wait(h, _meta=meta)
# Compute the RAW sub-txn path: src PE_DMA -> dst HBM_CTRL
from kernbench.policy.address.phyaddr import PhysAddr
ctx = next(iter(engine._components.values())).ctx
src_pe_prefix = f"sip{src_sip}.cube{src_cube}.pe{src_pe}"
# Resolve dst PA to HBM controller node
# The raw store kernel issues DmaWriteCmd on dst VA; in the engine
# this is translated via PE_MMU. For diagnostic we approximate
# the destination as the dst cube's HBM controller for slice
# belonging to dst_pe.
# Use the resolver on a constructed PA matching the same memory
# slice the kernel writes to.
# The tensor is "row_wise" sharded across cubes, so each cube
# owns row[cube_id, :], with each PE owning a column slice.
# The actual dst PA depends on the AHBM allocator; we read it
# via the tensor's shard map.
shard_map = getattr(t, "_shard_map", None) or getattr(t, "shard_map", None)
# Fallback: query the resolver directly by constructing a PA in
# the dst cube's HBM region. If shard_map is unavailable, still
# show the breakdown for src-PE-DMA -> first reachable HBM_CTRL
# in dst cube.
dst_hbm_id = f"sip{dst_sip}.cube{dst_cube}.hbm_ctrl"
if dst_hbm_id not in engine._components:
# try alternate naming
for nid in engine._components.keys():
if (
nid.startswith(f"sip{dst_sip}.cube{dst_cube}.")
and "hbm" in nid
):
dst_hbm_id = nid
break
# find_path() prepends ".pe_dma" to src_pe automatically
try:
raw_path = ctx.router.find_path(src_pe_prefix, dst_hbm_id)
except Exception as e:
raw_path = []
print(f" WARN: find_path raw failed: {e}")
if not raw_path:
# Try other HBM-related node names in dst cube
for nid in engine._components.keys():
if not nid.startswith(f"sip{dst_sip}.cube{dst_cube}."):
continue
if "hbm" not in nid:
continue
try:
p = ctx.router.find_path(src_pe_prefix, nid)
except Exception:
p = []
if p:
raw_path = p
print(f" (fallback raw dst node: {nid})")
break
return captured, ctx, raw_path, engine
finally:
restore()
# ── IPCQ path scenario ───────────────────────────────────────────────
def _run_ipcq():
captured, restore = _install_per_pe_capture()
dst_pe_ipcq_id = (
f"sip{DST[0]}.cube{DST[1]}.pe{DST[2]}.pe_ipcq"
)
arrival_log, restore_arrival = _install_meta_arrival_capture(
dst_pe_ipcq_id,
)
recv_log, restore_recv = _install_recv_capture(dst_pe_ipcq_id)
barrier_log, restore_barrier = _install_barrier_capture()
try:
topo = resolve_topology(str(TOPOLOGY_PATH))
engine = GraphEngine(topo.topology_obj, enable_data=True)
spec = topo.topology_obj.spec
src_sip, src_cube, src_pe = SRC
dst_sip, dst_cube, dst_pe = DST
cfg = load_ccl_config()
merged = resolve_algorithm_config(cfg, name="intercube_allreduce")
merged["slot_size"] = max(int(merged.get("slot_size", 4096)), NBYTES)
with RuntimeContext(
engine=engine,
target_device=DeviceSelector("all"),
correlation_id="diag_ipcq",
spec=spec,
) as rt:
configure_sfr_intercube_multisip(engine, spec, merged)
dp = DPPolicy(
cube="row_wise", pe="column_wise",
num_cubes=N_CUBES, num_pes=N_PES,
)
def kernel(t_ptr, n_elem, tl):
pe_id = tl.program_id(axis=0)
cube_id = tl.program_id(axis=1)
if cube_id == src_cube and pe_id == src_pe:
data = tl.load(t_ptr, shape=(n_elem,), dtype="f16")
tl.send(dir=("E" if HOP == "h3" else "S"), src=data)
elif cube_id == dst_cube and pe_id == dst_pe:
tl.recv(
dir=("W" if HOP == "h3" else "N"),
shape=(n_elem,), dtype="f16",
)
tensors = []
for s in sorted({src_sip, dst_sip}):
rt.ahbm.set_device(s)
t = rt.zeros(
(N_CUBES, N_PES * N_ELEM), dtype="f16",
dp=dp, name=f"sip{s}",
)
t.copy_(rt.from_numpy(
np.full((N_CUBES, N_PES * N_ELEM), 1.0, dtype=np.float16),
))
tensors.append(t)
all_pending = []
for tt in tensors:
pending = rt.launch(
"diag_ipcq_kernel", kernel, tt, N_ELEM, _defer_wait=True,
)
all_pending.extend(pending)
for h, _sip, meta in all_pending:
rt.wait(h, _meta=meta)
ctx = next(iter(engine._components.values())).ctx
src_pe_prefix = f"sip{src_sip}.cube{src_cube}.pe{src_pe}"
dst_pe_dma = f"sip{dst_sip}.cube{dst_cube}.pe{dst_pe}.pe_dma"
try:
ipcq_path = ctx.router.find_path(src_pe_prefix, dst_pe_dma)
except Exception as e:
ipcq_path = []
print(f" WARN: find_path ipcq failed: {e}")
# Snapshot DST PE_IPCQ qp state at end-of-run so we can see what
# peer_head_cache/my_tail looked like (and at which directions).
qp_state = _snapshot_qp_state(engine, dst_pe_ipcq_id)
return (captured, ctx, ipcq_path, engine,
arrival_log, qp_state, recv_log, barrier_log)
finally:
restore_barrier()
restore_recv()
restore_arrival()
restore()
# ── Test entry ───────────────────────────────────────────────────────
@pytest.mark.diagnostic
def test_pe_to_pe_diagnostic_h3():
print("\n" + "=" * 78)
print(f" Diagnostic: h3 inter-cube horizontal, nbytes={NBYTES}")
print(f" src={SRC} dst={DST}")
print("=" * 78)
# ── RAW scenario
print("\n[RAW] tl.load + tl.store (sender pays both legs)")
raw_per_pe, raw_ctx, raw_path, raw_engine = _run_raw()
print(f" per-PE pe_exec_ns ({len(raw_per_pe)} entries):")
src_id = f"sip{SRC[0]}.cube{SRC[1]}.pe{SRC[2]}.pe_cpu"
dst_id = f"sip{DST[0]}.cube{DST[1]}.pe{DST[2]}.pe_cpu"
for nid in (src_id, dst_id):
if nid in raw_per_pe:
print(f" {nid:<60s} {raw_per_pe[nid]:.2f} ns <-- key PE")
nonzero = {k: v for k, v in raw_per_pe.items() if v > 0.5}
if nonzero:
print(f" other PEs with pe_exec_ns > 0.5 ns:")
for nid, v in sorted(nonzero.items(), key=lambda kv: -kv[1])[:6]:
if nid not in (src_id, dst_id):
print(f" {nid:<60s} {v:.2f} ns")
print(f" max(pe_exec_ns) = "
f"{max(raw_per_pe.values()) if raw_per_pe else 0:.2f} ns")
if raw_path:
br = _path_breakdown(raw_ctx, raw_path, NBYTES)
_print_breakdown("RAW sub-txn path (src.pe_dma -> dst.hbm_ctrl)", br)
_dump_src_op_records(raw_engine, *SRC, "RAW")
# ── IPCQ scenario
print("\n[IPCQ] tl.send + tl.recv (recv pays inbound traversal+drain)")
(ipcq_per_pe, ipcq_ctx, ipcq_path, ipcq_engine,
arrival_log, qp_state, recv_log, barrier_log) = _run_ipcq()
print(f"\n [BARRIER LOG] {len(barrier_log)} _execute_kernel entries:")
src_id = f"sip{SRC[0]}.cube{SRC[1]}.pe{SRC[2]}.pe_cpu"
dst_id = f"sip{DST[0]}.cube{DST[1]}.pe{DST[2]}.pe_cpu"
n_skipped = 0
src_entry = None
dst_entry = None
for e in barrier_log:
if e["barrier_skipped"]:
n_skipped += 1
if e["node_id"] == src_id:
src_entry = e
if e["node_id"] == dst_id:
dst_entry = e
print(f" PEs entering _execute_kernel: {len(barrier_log)}")
print(f" PEs that SKIPPED barrier (env.now > target_start): {n_skipped}")
if src_entry:
print(
f" SRC pe ({src_id}): entry_now={src_entry['entry_now']:.2f} "
f"target_start={src_entry['target_start_ns']:.2f} "
f"skipped={src_entry['barrier_skipped']} "
f"late_ns={src_entry['delta_late_ns']:.2f}"
)
if dst_entry:
print(
f" DST pe ({dst_id}): entry_now={dst_entry['entry_now']:.2f} "
f"target_start={dst_entry['target_start_ns']:.2f} "
f"skipped={dst_entry['barrier_skipped']} "
f"late_ns={dst_entry['delta_late_ns']:.2f}"
)
# Top 5 latest arrivals
sorted_late = sorted(
[e for e in barrier_log if e["delta_late_ns"] is not None],
key=lambda e: -e["delta_late_ns"],
)[:5]
print(f" Top 5 latest PE arrivals (positive = barrier missed):")
for e in sorted_late:
if e["delta_late_ns"] > 0:
print(
f" {e['node_id']}: late by {e['delta_late_ns']:.2f} ns "
f"(entry={e['entry_now']:.2f}, target={e['target_start_ns']:.2f})"
)
print(f"\n [RECV LOG on dst pe_ipcq] {len(recv_log)} entries:")
for e in recv_log:
print(
f" {e['phase']:5s} t={e['t']:8.2f} ns "
f"dir={e['direction']} "
f"peer_head_cache={e['peer_head_cache']} "
f"my_tail={e['my_tail']}"
)
print(f"\n [META-ARRIVAL LOG on dst pe_ipcq] {len(arrival_log)} arrivals:")
for i, e in enumerate(arrival_log):
print(
f" #{i:2d} t={e['t']:8.2f} ns "
f"src=(sip{e['src_sip']},cube{e['src_cube']},pe{e['src_pe']}) "
f"dir={e['src_direction']} "
f"sender_seq={e['sender_seq']} "
f"matched_dir={e['matched_direction']} "
f"nbytes={e['nbytes']}"
)
for d, ph in e["peer_head_cache_before"].items():
mt = e["my_tail_before"][d]
if ph != 0 or mt != 0 or d == e["matched_direction"]:
print(
f" before: dir={d} peer_head_cache={ph} my_tail={mt}"
)
print(f"\n [QP STATE END-OF-RUN on dst pe_ipcq]:")
for d, st in qp_state.items():
print(
f" dir={d} peer_head_cache={st['peer_head_cache']} "
f"my_tail={st['my_tail']} rx_range=[{st['rx_range'][0]}..."
f"{st['rx_range'][1]}) n_slots={st['n_slots']} "
f"slot_size={st['slot_size']}"
)
print(f" per-PE pe_exec_ns ({len(ipcq_per_pe)} entries):")
for nid in (src_id, dst_id):
if nid in ipcq_per_pe:
print(f" {nid:<60s} {ipcq_per_pe[nid]:.2f} ns <-- key PE")
nonzero = {k: v for k, v in ipcq_per_pe.items() if v > 0.5}
if nonzero:
print(f" other PEs with pe_exec_ns > 0.5 ns:")
for nid, v in sorted(nonzero.items(), key=lambda kv: -kv[1])[:6]:
if nid not in (src_id, dst_id):
print(f" {nid:<60s} {v:.2f} ns")
print(f" max(pe_exec_ns) = "
f"{max(ipcq_per_pe.values()) if ipcq_per_pe else 0:.2f} ns")
if ipcq_path:
br = _path_breakdown(ipcq_ctx, ipcq_path, NBYTES)
_print_breakdown("IPCQ sub-txn path (src.pe_dma -> peer.pe_dma)", br)
_dump_src_op_records(ipcq_engine, *SRC, "IPCQ")
_dump_src_op_records(ipcq_engine, *DST, "IPCQ DST")
# ── Credit-return path analysis (where the missing IPCQ "ack" lives)
print("\n" + "-" * 78)
print("Credit-return path (current modeling)")
print("-" * 78)
src_pe_prefix = f"sip{SRC[0]}.cube{SRC[1]}.pe{SRC[2]}"
dst_pe_prefix = f"sip{DST[0]}.cube{DST[1]}.pe{DST[2]}"
# PE_IPCQ._credit_latency_ns calls
# ctx.router.find_path(self._pe_prefix, peer_pe_prefix)
# where the *destination* lacks the ".pe_dma" suffix. find_path()
# only auto-appends to the source, so this raises -> the except
# clause silently returns 0.0. Effectively credit latency = 0.
try:
ipcq_ctx.router.find_path(dst_pe_prefix, src_pe_prefix)
bug_caught = False
except Exception as e:
bug_caught = True
print(f" CONFIRMED BUG in _credit_latency_ns: dest lacks '.pe_dma' "
f"-> find_path raises -> caught exception -> returns 0.0")
print(f" Error: {e}")
# The intended credit path is recv -> sender (reverse data direction)
try:
credit_path = ipcq_ctx.router.find_path(
dst_pe_prefix, f"{src_pe_prefix}.pe_dma",
)
except Exception as e:
credit_path = []
print(f" WARN: corrected find_path credit failed: {e}")
if credit_path:
credit_size = 16 # PE_IPCQ default _credit_size_bytes
# Today's modeling: drain only, 16 bytes -> ~0.125 ns
cur = ipcq_ctx.compute_drain_ns(credit_path, credit_size)
# Proposed modeling: full path latency (edges + node overhead + drain)
proposed = ipcq_ctx.compute_path_latency_ns(credit_path, credit_size)
print(f" credit path nodes = {len(credit_path)} (recv -> sender)")
for nid in credit_path[:6]:
print(f" {nid}")
if len(credit_path) > 6:
print(f" ... {len(credit_path) - 6} more nodes")
br = _path_breakdown(ipcq_ctx, credit_path, credit_size)
print(f" edge_total_ns = {br['edge_total_ns']:.2f}")
print(f" overhead_total_ns = {br['overhead_total_ns']:.2f}")
print(f" drain_ns(16 bytes) = {br['drain_ns']:.2f}")
print(f" CURRENT _credit_latency_ns (drain only) = {cur:.3f} ns")
print(f" PROPOSED (compute_path_latency_ns) = {proposed:.2f} ns")
print(f" delta = {proposed - cur:+.2f} ns")
# ── Comparison summary
print("\n" + "-" * 78)
print("Summary")
print("-" * 78)
raw_max = max(raw_per_pe.values()) if raw_per_pe else 0.0
ipcq_max = max(ipcq_per_pe.values()) if ipcq_per_pe else 0.0
print(f" RAW max(pe_exec_ns) = {raw_max:.2f} ns")
print(f" IPCQ max(pe_exec_ns) (current) = {ipcq_max:.2f} ns")
print(f" delta (RAW - IPCQ current) = {raw_max - ipcq_max:+.2f} ns")
if credit_path:
ipcq_with_credit = ipcq_max + (proposed - cur)
print(
f" IPCQ projected w/ blocking credit + full path overhead "
f"= {ipcq_with_credit:.2f} ns"
)
print(
f" delta (RAW - IPCQ projected) = "
f"{raw_max - ipcq_with_credit:+.2f} ns "
f"(<= 0 means IPCQ >= RAW)"
)
# No assertions — this is observational.
assert raw_per_pe, "no RAW pe_exec_ns recorded"
assert ipcq_per_pe, "no IPCQ pe_exec_ns recorded"
+347
View File
@@ -0,0 +1,347 @@
"""PE-to-PE latency sweep across hop types and data sizes.
Compares IPCQ send/recv vs raw-DMA (tl.load + tl.store) latency for four
hop types:
H1 Intra-cube horizontal pe0 pe1
H2 Intra-cube vertical pe0 pe4
H3 Inter-cube horizontal sip0.cube0.pe0 sip0.cube1.pe0
H4 Inter-cube vertical sip0.cube0.pe0 sip0.cube4.pe0
Sizes: 128..10240 bytes. Emits PNGs with both lines plus a CSV.
"""
from __future__ import annotations
import csv
from dataclasses import dataclass
from pathlib import Path
import numpy as np
import pytest
from kernbench.ccl.install import load_ccl_config, resolve_algorithm_config
from kernbench.ccl.sfr_config import configure_sfr_intercube_multisip
from kernbench.policy.placement.dp import DPPolicy
from kernbench.runtime_api.context import RuntimeContext
from kernbench.runtime_api.types import DeviceSelector
from kernbench.sim_engine.engine import GraphEngine
from kernbench.topology.builder import resolve_topology
TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
PLOT_DIR = Path(__file__).parent / "pe2pe_latency_plots"
SIZES = [128, 256, 384, 512, 768, 1024, 2048, 4096, 8192, 10240]
N_CUBES = 16
N_PES = 8
ELEM_BYTES = 2 # f16
@dataclass(frozen=True)
class Hop:
id: str
label: str
src: tuple[int, int, int]
dst: tuple[int, int, int]
send_dir: str
recv_dir: str
supports_raw: bool
HOPS = [
Hop("h1_intra_horizontal", "Intra-cube horizontal (pe0 to pe1)",
(0, 0, 0), (0, 0, 1), "intra_E", "intra_W", True),
Hop("h2_intra_vertical", "Intra-cube vertical (pe0 to pe4)",
(0, 0, 0), (0, 0, 4), "intra_S", "intra_N", True),
Hop("h3_inter_cube_horizontal", "Inter-cube horizontal (cube0 to cube1)",
(0, 0, 0), (0, 1, 0), "E", "W", True),
Hop("h4_inter_cube_vertical", "Inter-cube vertical (cube0 to cube4)",
(0, 0, 0), (0, 4, 0), "S", "N", True),
]
def _make_engine():
topo = resolve_topology(str(TOPOLOGY_PATH))
engine = GraphEngine(topo.topology_obj, enable_data=True)
return engine, topo.topology_obj.spec
# ── IPCQ path ────────────────────────────────────────────────────────
def _measure_ipcq(hop: Hop, nbytes: int) -> float:
engine, spec = _make_engine()
cfg = load_ccl_config()
merged = resolve_algorithm_config(cfg, name="intercube_allreduce")
merged["slot_size"] = max(int(merged.get("slot_size", 4096)), nbytes)
n_elem = nbytes // ELEM_BYTES
src_sip, src_cube, src_pe = hop.src
dst_sip, dst_cube, dst_pe = hop.dst
send_dir, recv_dir = hop.send_dir, hop.recv_dir
with RuntimeContext(
engine=engine,
target_device=DeviceSelector("all"),
correlation_id=f"ipcq_{hop.id}_{nbytes}",
spec=spec,
) as ctx:
configure_sfr_intercube_multisip(engine, spec, merged)
dp = DPPolicy(
cube="row_wise", pe="column_wise",
num_cubes=N_CUBES, num_pes=N_PES,
)
def kernel(t_ptr, n_elem, tl):
pe_id = tl.program_id(axis=0)
cube_id = tl.program_id(axis=1)
if cube_id == src_cube and pe_id == src_pe:
data = tl.load(t_ptr, shape=(n_elem,), dtype="f16")
tl.send(dir=send_dir, src=data)
elif cube_id == dst_cube and pe_id == dst_pe:
tl.recv(dir=recv_dir, shape=(n_elem,), dtype="f16")
tensors = []
for s in sorted({src_sip, dst_sip}):
ctx.ahbm.set_device(s)
t = ctx.zeros(
(N_CUBES, N_PES * n_elem), dtype="f16",
dp=dp, name=f"sip{s}",
)
t.copy_(ctx.from_numpy(
np.full((N_CUBES, N_PES * n_elem), 1.0, dtype=np.float16),
))
tensors.append(t)
all_pending = []
for t in tensors:
pending = ctx.launch(
f"{hop.id}_ipcq", kernel, t, n_elem, _defer_wait=True,
)
all_pending.extend(pending)
for h, sip_id, meta in all_pending:
ctx.wait(h, _meta=meta)
# Per-PE kernel execution time (excludes launch dispatch and
# response aggregation). IPCQ: DST blocks on tl.recv until the
# send arrives, so max across SIPs = DST's transfer time.
pe_exec_vals = []
for h, _sip, _meta in all_pending:
_, trace = engine.get_completion(h)
if trace and trace.get("pe_exec_ns") is not None:
pe_exec_vals.append(float(trace["pe_exec_ns"]))
return max(pe_exec_vals) if pe_exec_vals else 0.0
# ── Raw DMA path (intra-SIP only) ────────────────────────────────────
def _measure_raw(hop: Hop, nbytes: int) -> float:
"""tl.load from source slice + tl.store to destination slice. The VA
mapping spans the cube mesh within one SIP (MmuMapMsg broadcasts to all
cubes of the SIP), so the store goes through the fabric to the
destination PE's HBM. No IPCQ protocol involved.
"""
if not hop.supports_raw:
raise RuntimeError(f"hop {hop.id} does not support raw path")
engine, spec = _make_engine()
n_elem = nbytes // ELEM_BYTES
src_sip, src_cube, src_pe = hop.src
dst_sip, dst_cube, dst_pe = hop.dst
assert src_sip == dst_sip
# Slice offsets in the (N_CUBES, N_PES * n_elem) tensor:
# row = cube, slice within row = pe * n_elem .. (pe+1)*n_elem
# Byte offsets from va_base:
src_off = (src_cube * N_PES + src_pe) * n_elem * ELEM_BYTES
dst_off = (dst_cube * N_PES + dst_pe) * n_elem * ELEM_BYTES
with RuntimeContext(
engine=engine,
target_device=DeviceSelector("all"),
correlation_id=f"raw_{hop.id}_{nbytes}",
spec=spec,
) as ctx:
dp = DPPolicy(
cube="row_wise", pe="column_wise",
num_cubes=N_CUBES, num_pes=N_PES,
)
ctx.ahbm.set_device(src_sip)
t = ctx.zeros(
(N_CUBES, N_PES * n_elem), dtype="f16",
dp=dp, name="raw_tensor",
)
t.copy_(ctx.from_numpy(
np.full((N_CUBES, N_PES * n_elem), 1.0, dtype=np.float16),
))
def kernel(t_ptr, n_elem, tl):
pe_id = tl.program_id(axis=0)
cube_id = tl.program_id(axis=1)
if cube_id == src_cube and pe_id == src_pe:
data = tl.load(
t_ptr + src_off, shape=(n_elem,), dtype="f16",
)
tl.store(t_ptr + dst_off, data)
pending = ctx.launch(
f"{hop.id}_raw", kernel, t, n_elem, _defer_wait=True,
)
for h, sip_id, meta in pending:
ctx.wait(h, _meta=meta)
# Per-PE kernel execution time. Raw: only SRC does real work
# (tl.load + tl.store, store is blocking), so max across all PEs
# = SRC's transfer time. Idle PEs contribute only overhead_ns.
pe_exec_vals = []
for h, _sip, _meta in pending:
_, trace = engine.get_completion(h)
if trace and trace.get("pe_exec_ns") is not None:
pe_exec_vals.append(float(trace["pe_exec_ns"]))
return max(pe_exec_vals) if pe_exec_vals else 0.0
# ── CSV + plotting ───────────────────────────────────────────────────
def _write_csv(records, path: Path) -> None:
path.parent.mkdir(parents=True, exist_ok=True)
with open(path, "w", newline="", encoding="utf-8") as f:
w = csv.DictWriter(
f, fieldnames=["hop", "label", "size_bytes", "path", "total_ns"],
)
w.writeheader()
for r in records:
w.writerow(r)
def _plot_per_hop(records, hop: Hop, path: Path) -> None:
import matplotlib.pyplot as plt
ipcq = sorted(
[r for r in records if r["hop"] == hop.id and r["path"] == "ipcq"],
key=lambda r: r["size_bytes"],
)
raw = sorted(
[r for r in records if r["hop"] == hop.id and r["path"] == "raw"],
key=lambda r: r["size_bytes"],
)
fig, ax = plt.subplots(figsize=(8, 5))
if ipcq:
ax.plot(
[r["size_bytes"] for r in ipcq],
[r["total_ns"] for r in ipcq],
marker="o", label="IPCQ (send/recv)", color="tab:blue",
)
if raw:
ax.plot(
[r["size_bytes"] for r in raw],
[r["total_ns"] for r in raw],
marker="s", label="Raw DMA (load+store)", color="tab:orange",
)
ax.set_xlabel("Data size (bytes)")
ax.set_ylabel("Latency (ns)")
ax.set_title(hop.label)
ax.grid(True, alpha=0.3)
ax.legend()
fig.tight_layout()
fig.savefig(path, dpi=120)
plt.close(fig)
def _plot_overview(records, path: Path) -> None:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 2, figsize=(13, 9))
axes = axes.flatten()
for i, hop in enumerate(HOPS):
ax = axes[i]
ipcq = sorted(
[r for r in records if r["hop"] == hop.id and r["path"] == "ipcq"],
key=lambda r: r["size_bytes"],
)
raw = sorted(
[r for r in records if r["hop"] == hop.id and r["path"] == "raw"],
key=lambda r: r["size_bytes"],
)
if ipcq:
ax.plot(
[r["size_bytes"] for r in ipcq],
[r["total_ns"] for r in ipcq],
marker="o", label="IPCQ", color="tab:blue",
)
if raw:
ax.plot(
[r["size_bytes"] for r in raw],
[r["total_ns"] for r in raw],
marker="s", label="Raw", color="tab:orange",
)
ax.set_title(hop.label, fontsize=10)
ax.set_xlabel("bytes")
ax.set_ylabel("ns")
ax.grid(True, alpha=0.3)
ax.legend(fontsize=8)
for j in range(len(HOPS), len(axes)):
axes[j].axis("off")
fig.suptitle(
"PE-to-PE latency: IPCQ vs raw DMA",
fontsize=14,
)
fig.tight_layout()
fig.savefig(path, dpi=120)
plt.close(fig)
# ── Test entry ───────────────────────────────────────────────────────
def test_pe_to_pe_latency_sweep():
records: list[dict] = []
for hop in HOPS:
for size in SIZES:
# IPCQ path
ipcq_ns = _measure_ipcq(hop, size)
records.append({
"hop": hop.id, "label": hop.label,
"size_bytes": size, "path": "ipcq",
"total_ns": ipcq_ns,
})
raw_s = "n/a"
if hop.supports_raw:
raw_ns = _measure_raw(hop, size)
records.append({
"hop": hop.id, "label": hop.label,
"size_bytes": size, "path": "raw",
"total_ns": raw_ns,
})
raw_s = f"{raw_ns:7.1f}ns"
print(
f"[{hop.id}] size={size:5d} "
f"ipcq={ipcq_ns:7.1f}ns raw={raw_s}"
)
PLOT_DIR.mkdir(parents=True, exist_ok=True)
_write_csv(records, PLOT_DIR / "summary.csv")
for hop in HOPS:
_plot_per_hop(records, hop, PLOT_DIR / f"{hop.id}.png")
_plot_overview(records, PLOT_DIR / "overview.png")
for hop in HOPS:
rs = sorted(
[r for r in records if r["hop"] == hop.id and r["path"] == "ipcq"],
key=lambda r: r["size_bytes"],
)
for r in rs:
assert r["total_ns"] > 0, f"{hop.id}: total_ns must be > 0"
print(f"\n Plots + CSV written to {PLOT_DIR}")
+182 -62
View File
@@ -1,7 +1,10 @@
import pytest
from kernbench.policy.address.allocator import AddressConfig, AllocationError, PEMemAllocator
from kernbench.policy.address.phyaddr import PhysAddr, PhysAddrError, UnitType
from kernbench.policy.address.phyaddr import (
PhysAddr, PhysAddrError, UnitType,
PESubUnit, MCPUSubUnit, IOCPUSubUnit,
)
_MB = 1 << 20
_GB = 1 << 30
@@ -23,13 +26,11 @@ _CFG = AddressConfig(
def test_physaddr_immutable():
pa = PhysAddr.hbm_addr(rack_id=0, sip_id=0, cube_id=0, hbm_offset=0)
pa = PhysAddr.hbm_addr(sip_id=0, die_id=0, hbm_offset=0)
with pytest.raises(AttributeError):
pa.rack_id = 1 # type: ignore[misc]
# hashable
{pa}
# comparable
pa2 = PhysAddr.hbm_addr(rack_id=0, sip_id=0, cube_id=0, hbm_offset=0)
pa.sip_id = 1 # type: ignore[misc]
{pa} # hashable
pa2 = PhysAddr.hbm_addr(sip_id=0, die_id=0, hbm_offset=0)
assert pa == pa2
@@ -37,120 +38,133 @@ def test_physaddr_immutable():
def test_hbm_encode_decode_roundtrip():
pa = PhysAddr.hbm_addr(rack_id=2, sip_id=3, cube_id=5, hbm_offset=0x1000)
pa = PhysAddr.hbm_addr(sip_id=3, die_id=5, hbm_offset=0x1000)
raw = pa.encode()
dec = PhysAddr.decode(raw)
assert dec.rack_id == 2
assert dec.sip_id == 3
assert dec.cube_id == 5
assert dec.die_id == 5
assert dec.kind == "hbm"
assert dec.hbm_offset == 0x1000
# ── PE resource encode/decode roundtrip ─────────────────────────────
# ── PE resource encode/decode roundtrip (new layout) ───────────────
def test_pe_resource_encode_decode_roundtrip():
pa = PhysAddr(
rack_id=1, sip_id=2, sip_seg=7, local_offset=0,
kind="pe_resource", cube_id=7,
unit_type=UnitType.PE, pe_id=3, ext=1, sub_offset=0xFF,
pa = PhysAddr.pe_resource_addr(
sip_id=2, die_id=7, pe_id=3,
pe_sub_unit=PESubUnit.PE_TCM, sub_offset=0xFF,
)
# manually build local_offset matching bit layout
local_offset = (UnitType.PE << 34) | (3 << 30) | (1 << 29) | 0xFF
pa2 = PhysAddr(
rack_id=1, sip_id=2, sip_seg=7, local_offset=local_offset,
kind="pe_resource", cube_id=7,
unit_type=UnitType.PE, pe_id=3, ext=1, sub_offset=0xFF,
)
raw = pa2.encode()
raw = pa.encode()
dec = PhysAddr.decode(raw)
assert dec.kind == "pe_resource"
assert dec.unit_type == UnitType.PE
assert dec.pe_id == 3
assert dec.ext == 1
assert dec.pe_sub_unit == PESubUnit.PE_TCM
assert dec.sub_offset == 0xFF
assert dec.die_id == 7
assert dec.sip_id == 2
def test_pe_resource_all_sub_units():
"""Each PE sub-unit roundtrips correctly."""
for su in PESubUnit:
pa = PhysAddr.pe_resource_addr(
sip_id=0, die_id=0, pe_id=0,
pe_sub_unit=su, sub_offset=42,
)
dec = PhysAddr.decode(pa.encode())
assert dec.pe_sub_unit == su
assert dec.sub_offset == 42
# ── pe_hbm_addr factory ────────────────────────────────────────────
def test_pe_hbm_addr_factory():
SLICE = 6 * (1 << 30) # 6 GB per PE slice
SLICE = 6 * _GB
pa = PhysAddr.pe_hbm_addr(
rack_id=0, sip_id=0, cube_id=0,
sip_id=0, die_id=0,
pe_id=2, pe_local_hbm_offset=1024, slice_size_bytes=SLICE,
)
assert pa.kind == "hbm"
assert pa.cube_id == 0
assert pa.die_id == 0
assert pa.hbm_offset == 2 * SLICE + 1024
def test_pe_hbm_addr_overflow():
SLICE = 6 * (1 << 30)
SLICE = 6 * _GB
with pytest.raises(PhysAddrError, match="pe_local_hbm_offset"):
PhysAddr.pe_hbm_addr(
rack_id=0, sip_id=0, cube_id=0,
sip_id=0, die_id=0,
pe_id=0, pe_local_hbm_offset=SLICE, slice_size_bytes=SLICE,
)
# ── Invalid unit_type decode (fix #1) ──────────────────────────────
# ── Invalid resource_kind decode ──────────────────────────────────
def test_invalid_unit_type_raises():
# Craft a PE-resource address with unit_type=7 (invalid)
local_offset = (7 << 34) | (0 << 30) | 0
pa_raw = PhysAddr(
rack_id=0, sip_id=0, sip_seg=0, local_offset=local_offset,
)
def test_invalid_resource_kind_raises():
# resource_kind=7 (invalid), addr_space=0
local_offset = (7 << 34) | 0
pa_raw = PhysAddr(sip_id=0, die_id=0, local_offset=local_offset)
raw = pa_raw.encode()
with pytest.raises(PhysAddrError, match="unit_type"):
with pytest.raises(PhysAddrError, match="resource_kind"):
PhysAddr.decode(raw)
# ── hbm_pe_id utility (fix #3) ─────────────────────────────────────
# ── hbm_pe_id utility ─────────────────────────────────────────────
def test_hbm_pe_id_utility():
SLICE = 6 * (1 << 30) # 6 GB
SLICE = 6 * _GB
pa = PhysAddr.pe_hbm_addr(
rack_id=0, sip_id=0, cube_id=0,
sip_id=0, die_id=0,
pe_id=5, pe_local_hbm_offset=256, slice_size_bytes=SLICE,
)
assert PhysAddr.hbm_pe_id(pa.hbm_offset, SLICE) == 5
# ── UnitType.SRAM exists (fix #5) ──────────────────────────────────
# ── UnitType / sub-unit enums ──────────────────────────────────────
def test_sram_unit_type_exists():
assert UnitType.SRAM == 2
def test_pe_sub_unit_enum():
assert PESubUnit.PE_TCM == 6
assert PESubUnit.IPCQ == 2
def test_mcpu_sub_unit_enum():
assert MCPUSubUnit.MCPU_SRAM == 5
def test_iocpu_sub_unit_enum():
assert IOCPUSubUnit.IO_SRAM == 5
# ── cube_sram_addr factory + roundtrip ──────────────────────────────
def test_cube_sram_addr_roundtrip():
pa = PhysAddr.cube_sram_addr(
rack_id=0, sip_id=1, cube_id=3, sram_offset=0x800,
)
pa = PhysAddr.cube_sram_addr(sip_id=1, die_id=3, sram_offset=0x800)
assert pa.kind == "pe_resource"
assert pa.unit_type == UnitType.SRAM
assert pa.cube_id == 3
assert pa.die_id == 3
assert pa.sub_offset == 0x800
# encode → decode roundtrip
dec = PhysAddr.decode(pa.encode())
assert dec.unit_type == UnitType.SRAM
assert dec.cube_id == 3
assert dec.die_id == 3
assert dec.sub_offset == 0x800
def test_cube_sram_addr_range_check():
with pytest.raises(PhysAddrError):
PhysAddr.cube_sram_addr(
rack_id=0, sip_id=0, cube_id=0,
sram_offset=(1 << 29), # exceeds 29-bit sub_offset
sip_id=0, die_id=0,
sram_offset=(1 << 25), # exceeds 25-bit sub_offset
)
@@ -158,29 +172,137 @@ def test_cube_sram_addr_range_check():
def test_pe_tcm_addr_roundtrip():
pa = PhysAddr.pe_tcm_addr(
rack_id=0, sip_id=0, cube_id=2, pe_id=7, tcm_offset=0x400,
)
pa = PhysAddr.pe_tcm_addr(sip_id=0, die_id=2, pe_id=7, tcm_offset=0x400)
assert pa.kind == "pe_resource"
assert pa.unit_type == UnitType.PE
assert pa.pe_id == 7
assert pa.cube_id == 2
assert pa.die_id == 2
assert pa.pe_sub_unit == PESubUnit.PE_TCM
assert pa.sub_offset == 0x400
# encode → decode roundtrip
dec = PhysAddr.decode(pa.encode())
assert dec.unit_type == UnitType.PE
assert dec.pe_id == 7
assert dec.pe_sub_unit == PESubUnit.PE_TCM
assert dec.sub_offset == 0x400
def test_pe_tcm_addr_range_check():
with pytest.raises(PhysAddrError):
PhysAddr.pe_tcm_addr(
rack_id=0, sip_id=0, cube_id=0, pe_id=0,
tcm_offset=(1 << 29), # exceeds 29-bit sub_offset
sip_id=0, die_id=0, pe_id=0,
tcm_offset=(1 << 25), # exceeds 25-bit sub_offset
)
# ── MCPU resource factory + roundtrip ──────────────────────────────
def test_mcpu_resource_roundtrip():
pa = PhysAddr.mcpu_resource_addr(
sip_id=0, die_id=1,
mcpu_sub_unit=MCPUSubUnit.MCPU_SRAM, sub_offset=0x100,
)
assert pa.kind == "pe_resource"
assert pa.unit_type == UnitType.MCPU
assert pa.mcpu_sub_unit == MCPUSubUnit.MCPU_SRAM
assert pa.sub_offset == 0x100
dec = PhysAddr.decode(pa.encode())
assert dec.unit_type == UnitType.MCPU
assert dec.mcpu_sub_unit == MCPUSubUnit.MCPU_SRAM
assert dec.sub_offset == 0x100
# ── IOCHIPLET: IOCPU factory + roundtrip ────────────────────────────
def test_iocpu_resource_roundtrip():
pa = PhysAddr.iocpu_resource_addr(
sip_id=1, die_id=17,
iocpu_sub_unit=IOCPUSubUnit.IPCQ, sub_offset=0x20000,
)
assert pa.kind == "iocpu"
assert pa.iocpu_sub_unit == IOCPUSubUnit.IPCQ
assert pa.sub_offset == 0x20000
dec = PhysAddr.decode(pa.encode())
assert dec.kind == "iocpu"
assert dec.iocpu_sub_unit == IOCPUSubUnit.IPCQ
assert dec.sub_offset == 0x20000
assert dec.die_id == 17
def test_iocpu_die_range_check():
with pytest.raises(PhysAddrError, match="IOCHIPLET"):
PhysAddr.iocpu_resource_addr(
sip_id=0, die_id=5, # not a chiplet die
iocpu_sub_unit=0, sub_offset=0,
)
# ── IOCHIPLET: UAL factory + roundtrip ──────────────────────────────
def test_ual_addr_roundtrip():
pa = PhysAddr.ual_addr(sip_id=0, die_id=16, ual_offset=0x1000)
assert pa.kind == "ual"
dec = PhysAddr.decode(pa.encode())
assert dec.kind == "ual"
assert dec.die_id == 16
assert dec.chiplet_offset >= (1 << 31) # >= 2 GB boundary
# ── die_id dispatch ────────────────────────────────────────────────
def test_die_id_ahbm_range():
for die in [0, 15]:
pa = PhysAddr.hbm_addr(sip_id=0, die_id=die, hbm_offset=0)
dec = PhysAddr.decode(pa.encode())
assert dec.kind == "hbm"
assert dec.die_id == die
def test_die_id_chiplet_range():
for die in [16, 20]:
pa = PhysAddr.iocpu_resource_addr(
sip_id=0, die_id=die,
iocpu_sub_unit=0, sub_offset=0,
)
dec = PhysAddr.decode(pa.encode())
assert dec.kind == "iocpu"
assert dec.die_id == die
def test_die_id_reserved_raises():
raw = (0 << 47) | (21 << 42) | 0 # die_id=21 (reserved)
with pytest.raises(PhysAddrError, match="reserved"):
PhysAddr.decode(raw)
# ── Boundary values ────────────────────────────────────────────────
def test_sip_boundary():
pa = PhysAddr.hbm_addr(sip_id=15, die_id=0, hbm_offset=0)
dec = PhysAddr.decode(pa.encode())
assert dec.sip_id == 15
def test_mbz_enforcement_ahbm():
"""AHBM local_offset bits [41:38] must be zero."""
local_offset = (1 << 38) | (1 << 37) # MBZ bit set + HBM
pa = PhysAddr(sip_id=0, die_id=0, local_offset=local_offset)
with pytest.raises(PhysAddrError, match="bits \\[41:38\\]"):
pa.encode()
def test_mbz_enforcement_chiplet():
"""IOCHIPLET local_offset bits [41:40] must be zero."""
local_offset = (1 << 40) | 0 # MBZ bit set
pa = PhysAddr(sip_id=0, die_id=16, local_offset=local_offset)
with pytest.raises(PhysAddrError, match="bits \\[41:40\\]"):
pa.encode()
# ── AddressConfig ───────────────────────────────────────────────────
@@ -193,7 +315,7 @@ def test_address_config_derived_sizes():
def _make_alloc(pe_id: int = 0) -> PEMemAllocator:
return PEMemAllocator(rack_id=0, sip_id=0, cube_id=0, pe_id=pe_id, cfg=_CFG)
return PEMemAllocator(sip_id=0, die_id=0, pe_id=pe_id, cfg=_CFG)
def test_allocator_hbm_basic():
@@ -201,8 +323,7 @@ def test_allocator_hbm_basic():
pa = a.alloc_hbm(4096)
assert pa.kind == "hbm"
assert pa.sip_id == 0
assert pa.cube_id == 0
# hbm_offset should be pe3's slice start
assert pa.die_id == 0
assert pa.hbm_offset == 3 * 6 * _GB
@@ -210,8 +331,8 @@ def test_allocator_hbm_sequential():
a = _make_alloc()
pa1 = a.alloc_hbm(1024)
pa2 = a.alloc_hbm(2048)
assert pa1.hbm_offset == 0 # pe0 slice start + 0
assert pa2.hbm_offset == 1024 # pe0 slice start + 1024
assert pa1.hbm_offset == 0
assert pa2.hbm_offset == 1024
def test_allocator_hbm_overflow():
@@ -235,7 +356,6 @@ def test_allocator_tcm_basic():
def test_allocator_tcm_respects_reserved():
a = _make_alloc()
# allocatable = 12 MB, should succeed
a.alloc_tcm(12 * _MB)
assert a.tcm_used == 12 * _MB
assert a.tcm_total == 12 * _MB
+1 -1
View File
@@ -21,7 +21,7 @@ def _engine():
def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
slice_bytes = 48 * (1 << 30) // 8
pa = PhysAddr.pe_hbm_addr(
rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
sip_id=sip, die_id=cube, pe_id=pe_id,
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
)
return pa.encode()
+15 -15
View File
@@ -20,7 +20,7 @@ def test_resolve_hbm_addr():
"""HBM address -> sip{S}.cube{C}.hbm_ctrl (single controller per cube)."""
g = _graph()
resolver = AddressResolver(g)
pa = PhysAddr.hbm_addr(rack_id=0, sip_id=0, cube_id=3, hbm_offset=0x1000)
pa = PhysAddr.hbm_addr(sip_id=0, die_id=3, hbm_offset=0x1000)
assert resolver.resolve(pa) == "sip0.cube3.hbm_ctrl"
@@ -28,33 +28,33 @@ def test_resolve_hbm_addr_high_offset():
"""HBM address with large offset still resolves to same hbm_ctrl."""
g = _graph()
resolver = AddressResolver(g)
pa = PhysAddr.hbm_addr(rack_id=0, sip_id=0, cube_id=0, hbm_offset=0x600000000)
pa = PhysAddr.hbm_addr(sip_id=0, die_id=0, hbm_offset=0x600000000)
assert resolver.resolve(pa) == "sip0.cube0.hbm_ctrl"
def test_resolve_pe_tcm_addr():
"""PE TCM address sip{S}.cube{C}.pe{P}.pe_tcm"""
"""PE TCM address -> sip{S}.cube{C}.pe{P}.pe_tcm"""
g = _graph()
resolver = AddressResolver(g)
pa = PhysAddr.pe_tcm_addr(rack_id=0, sip_id=1, cube_id=5, pe_id=7, tcm_offset=0x400)
pa = PhysAddr.pe_tcm_addr(sip_id=1, die_id=5, pe_id=7, tcm_offset=0x400)
assert resolver.resolve(pa) == "sip1.cube5.pe7.pe_tcm"
def test_resolve_sram_addr():
"""SRAM address sip{S}.cube{C}.sram"""
"""SRAM address -> sip{S}.cube{C}.sram"""
g = _graph()
resolver = AddressResolver(g)
pa = PhysAddr.cube_sram_addr(rack_id=0, sip_id=0, cube_id=10, sram_offset=0x800)
pa = PhysAddr.cube_sram_addr(sip_id=0, die_id=10, sram_offset=0x800)
assert resolver.resolve(pa) == "sip0.cube10.sram"
def test_resolve_mcpu_addr():
"""MCPU pe_resource address sip{S}.cube{C}.m_cpu"""
"""MCPU pe_resource address -> sip{S}.cube{C}.m_cpu"""
g = _graph()
resolver = AddressResolver(g)
pa = PhysAddr(
rack_id=0, sip_id=0, sip_seg=2, local_offset=(UnitType.MCPU << 34),
kind="pe_resource", cube_id=2, unit_type=UnitType.MCPU,
pa = PhysAddr.mcpu_resource_addr(
sip_id=0, die_id=2,
mcpu_sub_unit=0, sub_offset=0,
)
assert resolver.resolve(pa) == "sip0.cube2.m_cpu"
@@ -64,7 +64,7 @@ def test_resolve_nonexistent_node():
g = _graph()
resolver = AddressResolver(g)
# sip_id=15 doesn't exist in the 2-SIP topology
pa = PhysAddr.hbm_addr(rack_id=0, sip_id=15, cube_id=0, hbm_offset=0)
pa = PhysAddr.hbm_addr(sip_id=15, die_id=0, hbm_offset=0)
with pytest.raises(RoutingError):
resolver.resolve(pa)
@@ -73,7 +73,7 @@ def test_resolve_nonexistent_node():
def test_path_local_hbm():
"""PE0 -> hbm_ctrl: pe_dma router hbm_ctrl (through router mesh)."""
"""PE0 -> hbm_ctrl: pe_dma -> router -> hbm_ctrl (through router mesh)."""
g = _graph()
router = PathRouter(g)
path = router.find_path("sip0.cube0.pe0", "sip0.cube0.hbm_ctrl")
@@ -107,7 +107,7 @@ def test_all_pe_hbm_equidistant():
"""All PEs in a cube have equal routing distance to hbm_ctrl.
With n_to_one mapping and high routing weight on HBM edges,
all PEhbm_ctrl paths have the same accumulated distance.
all PE->hbm_ctrl paths have the same accumulated distance.
"""
g = _graph()
router = PathRouter(g)
@@ -151,7 +151,7 @@ def test_path_remote_cube_hbm():
def test_path_sram_via_router_mesh():
"""PE SRAM must go through router mesh nodes."""
"""PE -> SRAM must go through router mesh nodes."""
g = _graph()
router = PathRouter(g)
path = router.find_path("sip0.cube0.pe0", "sip0.cube0.sram")
@@ -168,7 +168,7 @@ def test_path_sram_via_router_mesh():
def test_path_local_tcm():
"""PE0 own TCM is PE-internal, not via router mesh."""
"""PE0 -> own TCM is PE-internal, not via router mesh."""
g = _graph()
router = PathRouter(g)
path = router.find_path("sip0.cube0.pe0", "sip0.cube0.pe0.pe_tcm")
+106
View File
@@ -0,0 +1,106 @@
"""Rectangular (non-square) SIP-level 2D topology support.
Phase 1 regression target: today the 2D builtin topology functions in
``kernbench.ccl.topologies`` (``mesh_2d``, ``torus_2d``,
``mesh_2d_no_wrap``) hardcode ``side = sqrt(world_size)`` and raise
``ValueError`` for any non-square ``world_size``. This blocks running
the allreduce sweep at n_sips=6 on torus/mesh layouts.
Phase 2 will extend these functions to accept optional ``w, h`` kwargs
so a 2×3 (or 3×2, etc.) layout works. Until then, every test below is
expected to FAIL.
Layout convention used here (matches non-rectangular case):
rank = row * w + col for 0 <= row < h, 0 <= col < w
For w=2, h=3, world_size=6 the layout is:
col=0 col=1
row=0: 0 1
row=1: 2 3
row=2: 4 5
"""
from __future__ import annotations
import pytest
from kernbench.ccl.topologies import (
mesh_2d,
mesh_2d_no_wrap,
torus_2d,
)
# ── mesh_2d_no_wrap (no wrap-around) ──────────────────────────────────
def test_mesh_2d_no_wrap_2x3_top_left():
"""rank 0 (top-left, no N, no W): only S and E."""
nbrs = mesh_2d_no_wrap(rank=0, world_size=6, w=2, h=3)
assert nbrs == {"S": 2, "E": 1}, nbrs
def test_mesh_2d_no_wrap_2x3_top_right():
"""rank 1 (top-right, no N, no E): only S and W."""
nbrs = mesh_2d_no_wrap(rank=1, world_size=6, w=2, h=3)
assert nbrs == {"S": 3, "W": 0}, nbrs
def test_mesh_2d_no_wrap_2x3_middle_left():
"""rank 2 (middle-left, no W): N, S, E."""
nbrs = mesh_2d_no_wrap(rank=2, world_size=6, w=2, h=3)
assert nbrs == {"N": 0, "S": 4, "E": 3}, nbrs
def test_mesh_2d_no_wrap_2x3_bottom_right():
"""rank 5 (bottom-right, no S, no E): only N and W."""
nbrs = mesh_2d_no_wrap(rank=5, world_size=6, w=2, h=3)
assert nbrs == {"N": 3, "W": 4}, nbrs
# ── torus_2d (wrap-around on all four edges) ─────────────────────────
def test_torus_2d_2x3_top_left():
"""rank 0: N wraps to row 2 col 0 (rank 4); W wraps to col 1 (rank 1)."""
nbrs = torus_2d(rank=0, world_size=6, w=2, h=3)
assert nbrs == {"N": 4, "S": 2, "W": 1, "E": 1}, nbrs
def test_torus_2d_2x3_bottom_right():
"""rank 5: S wraps to row 0 (rank 1); E wraps to col 0 (rank 4)."""
nbrs = torus_2d(rank=5, world_size=6, w=2, h=3)
assert nbrs == {"N": 3, "S": 1, "W": 4, "E": 4}, nbrs
# ── mesh_2d alias for torus_2d ───────────────────────────────────────
def test_mesh_2d_2x3_matches_torus_2d():
"""mesh_2d is currently a torus alias; behaviour must match torus_2d."""
for rank in range(6):
assert mesh_2d(rank=rank, world_size=6, w=2, h=3) == \
torus_2d(rank=rank, world_size=6, w=2, h=3)
# ── Back-compat: square layouts still work without w/h kwargs ────────
def test_square_back_compat_mesh_2d_no_wrap():
"""Calling without w, h should still work for square world_size."""
nbrs = mesh_2d_no_wrap(rank=0, world_size=4)
assert nbrs == {"S": 2, "E": 1}, nbrs
def test_square_back_compat_torus_2d():
nbrs = torus_2d(rank=0, world_size=4)
assert nbrs == {"N": 2, "S": 2, "W": 1, "E": 1}, nbrs
# ── Validation: w*h must match world_size ────────────────────────────
def test_rectangular_dims_must_match_world_size():
"""Phase 2 contract: explicit w, h must satisfy w*h == world_size."""
with pytest.raises(ValueError):
mesh_2d_no_wrap(rank=0, world_size=6, w=3, h=3) # 9 != 6
+1 -1
View File
@@ -44,7 +44,7 @@ _CFG = AddressConfig(
def _make_allocators(num_pe: int = 8) -> dict[tuple[int, int, int], PEMemAllocator]:
return {
(0, 0, i): PEMemAllocator(rack_id=0, sip_id=0, cube_id=0, pe_id=i, cfg=_CFG)
(0, 0, i): PEMemAllocator(sip_id=0, die_id=0, pe_id=i, cfg=_CFG)
for i in range(num_pe)
}
+2 -2
View File
@@ -55,7 +55,7 @@ def _make_ctx():
def test_allocator_free_hbm_reclaims_space():
"""free_hbm returns HBM space; subsequent alloc can reuse it."""
a = PEMemAllocator(rack_id=0, sip_id=0, cube_id=0, pe_id=0, cfg=_CFG)
a = PEMemAllocator(sip_id=0, die_id=0, pe_id=0, cfg=_CFG)
pa1 = a.alloc_hbm(4096)
used_after_alloc = a.hbm_used
a.free_hbm(pa1, 4096)
@@ -66,7 +66,7 @@ def test_allocator_free_hbm_reclaims_space():
def test_allocator_free_tcm_reclaims_space():
"""free_tcm returns TCM space."""
a = PEMemAllocator(rack_id=0, sip_id=0, cube_id=0, pe_id=0, cfg=_CFG)
a = PEMemAllocator(sip_id=0, die_id=0, pe_id=0, cfg=_CFG)
pa1 = a.alloc_tcm(256)
used_after_alloc = a.tcm_used
a.free_tcm(pa1, 256)
+1 -1
View File
@@ -39,7 +39,7 @@ _CFG = AddressConfig(
def _make_allocators(num_pe: int = 8) -> dict[tuple[int, int, int], PEMemAllocator]:
return {
(0, 0, i): PEMemAllocator(rack_id=0, sip_id=0, cube_id=0, pe_id=i, cfg=_CFG)
(0, 0, i): PEMemAllocator(sip_id=0, die_id=0, pe_id=i, cfg=_CFG)
for i in range(num_pe)
}
+1 -1
View File
@@ -70,7 +70,7 @@ def _make_standalone(shape, num_pe=NUM_PE):
sram_bytes_per_cube=32 * _MB,
)
allocators = {
(0, 0, i): PEMemAllocator(rack_id=0, sip_id=0, cube_id=0, pe_id=i, cfg=cfg)
(0, 0, i): PEMemAllocator(sip_id=0, die_id=0, pe_id=i, cfg=cfg)
for i in range(num_pe)
}
va_alloc = VirtualAllocator(va_base=0x1_0000_0000, va_size=64 * _GB, page_size=4096)