Compare commits
6 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| e9cc40f74d | |||
| c1a5cf3a2a | |||
| 90874abbfe | |||
| 19dfc86dc3 | |||
| 14d800b0ae | |||
| 6918e6e906 |
@@ -67,6 +67,76 @@ Completion semantics:
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
### D5. Launch timing is endpoint-synchronized
|
||||||
|
|
||||||
|
All PEs targeted by a single kernel launch MUST begin executing the kernel
|
||||||
|
body at the same simulated time, regardless of their dispatch path length
|
||||||
|
from the launch entry point.
|
||||||
|
|
||||||
|
Rationale. The dispatch tree Host → IO_CPU → M_CPU → PE_CPU has variable
|
||||||
|
latency at every level. PEs near their M_CPU receive the launch earlier
|
||||||
|
than PEs farther away; cubes near an IO_CPU receive it earlier than cubes
|
||||||
|
farther away. Without synchronization, each PE's kernel begins at a
|
||||||
|
different `env.now`, making per-PE metrics such as `pe_exec_ns` a function
|
||||||
|
of dispatch-path geometry rather than of the kernel's behavior —
|
||||||
|
producing measurement artifacts in benchmarks that time kernel-internal
|
||||||
|
waits (for example `tl.recv` on cross-cube or cross-SIP hops).
|
||||||
|
|
||||||
|
Mechanism.
|
||||||
|
|
||||||
|
- `KernelLaunchMsg` carries an optional `target_start_ns: float | None`.
|
||||||
|
- **IO_CPU** is the canonical stamper. On fan-out to M_CPUs, it
|
||||||
|
computes `target_start_ns = env.now + max_latency` where
|
||||||
|
`max_latency` is the maximum, over every target (sip, cube, pe)
|
||||||
|
tuple, of the **two-leg dispatch chain**:
|
||||||
|
|
||||||
|
```
|
||||||
|
max_latency(sip, cube, pe) =
|
||||||
|
compute_path_latency_ns(find_node_path(io_cpu, m_cpu(sip, cube)))
|
||||||
|
+ compute_path_latency_ns(find_node_path(m_cpu(sip, cube), pe_cpu))
|
||||||
|
- io_cpu.overhead_ns
|
||||||
|
- m_cpu.overhead_ns
|
||||||
|
```
|
||||||
|
|
||||||
|
This models the actual dispatch as **two sequential Transactions**
|
||||||
|
(IO_CPU → M_CPU, then M_CPU → PE_CPU). Each leg's
|
||||||
|
`compute_path_latency_ns` adds its endpoints' `overhead_ns`;
|
||||||
|
`io_cpu.overhead_ns` is subtracted because IO_CPU has already
|
||||||
|
paid it before this method runs, and `m_cpu.overhead_ns` is
|
||||||
|
subtracted once because it appears as endpoint of leg1 *and*
|
||||||
|
start of leg2 but is paid only once at run time. A single
|
||||||
|
`find_node_path(io_cpu, pe_cpu)` walk is **not** equivalent —
|
||||||
|
it can pick a graph path that bypasses M_CPU and silently
|
||||||
|
under-shoots the prediction for far cubes, breaking the D5
|
||||||
|
invariant.
|
||||||
|
|
||||||
|
The fanned-out sub-Transactions carry **`nbytes = 0`** for
|
||||||
|
`KernelLaunchMsg` (control message only). Without this,
|
||||||
|
large kernel-launch payloads would occupy fabric BW on the
|
||||||
|
shared first hop and serialize the per-cube dispatch, pushing
|
||||||
|
far M_CPUs past `target_start_ns` and re-introducing the
|
||||||
|
late-arrival violation.
|
||||||
|
- **M_CPU** passes an already-stamped `target_start_ns` through
|
||||||
|
unchanged. Only when the value is absent (e.g. a direct
|
||||||
|
launch-to-M_CPU unit test) does M_CPU compute a per-cube barrier
|
||||||
|
`env.now + max(local command-path latency)`.
|
||||||
|
- **PE_CPU** yields `env.timeout(target_start_ns - env.now)` at the top
|
||||||
|
of `_execute_kernel`, before recording `pe_exec_start` and invoking
|
||||||
|
the kernel body.
|
||||||
|
- When `target_start_ns is None`, PE_CPU falls through to the legacy
|
||||||
|
unsynchronized behavior — preserving backward compatibility.
|
||||||
|
|
||||||
|
IO_CPU-level stamping guarantees every PE across every targeted cube
|
||||||
|
uses the same barrier sim-time, eliminating both the within-cube
|
||||||
|
dispatch-offset artifact *and* the cross-cube offset artifact in
|
||||||
|
multi-cube launches. Models a real-hardware timed-broadcast launch
|
||||||
|
(latency-equalized dispatch tree).
|
||||||
|
|
||||||
|
The synchronization is internal to the engine / IO_CPU / M_CPU / PE_CPU
|
||||||
|
control plane — runtime API and application kernels are unchanged.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Links
|
## Links
|
||||||
|
|
||||||
- SPEC R1, R2, R7, R8
|
- SPEC R1, R2, R7, R8
|
||||||
|
|||||||
@@ -372,24 +372,41 @@ When the receiver frees a slot, the sender must learn about it
|
|||||||
travel through general vc_comm fabric — it uses a **separate fast
|
travel through general vc_comm fabric — it uses a **separate fast
|
||||||
path**, an abstraction of the NVLink / UCIe credit-return wire.
|
path**, an abstraction of the NVLink / UCIe credit-return wire.
|
||||||
|
|
||||||
**Latency** is computed from the **bottleneck BW on the path**, not a
|
**Latency** is computed from the **full path latency** (per-node
|
||||||
magic constant:
|
overhead + edge propagation + drain), not a magic constant:
|
||||||
|
|
||||||
```
|
```
|
||||||
credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes)
|
credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes)
|
||||||
path = router.find_path(self_pe, peer_pe)
|
path = router.find_path(self_pe, peer_pe.pe_dma)
|
||||||
latency = compute_drain_ns(path, credit_size_bytes)
|
latency = compute_path_latency_ns(path, credit_size_bytes)
|
||||||
= credit_size_bytes / bottleneck_bw_on_path
|
= sum(edge.distance_mm * ns_per_mm)
|
||||||
|
+ sum(node_overhead_ns[n] for n in path)
|
||||||
|
+ credit_size_bytes / bottleneck_bw_on_path
|
||||||
```
|
```
|
||||||
|
|
||||||
|
The router auto-appends `.pe_dma` to the source only, so the
|
||||||
|
destination MUST be spelled with the explicit `.pe_dma` suffix or
|
||||||
|
`find_path` raises and the credit silently teleports at zero cost
|
||||||
|
(latent bug fixed alongside this update).
|
||||||
|
|
||||||
|
`tl.recv` blocks on the credit-emit completion (recv yields-from
|
||||||
|
`_delayed_credit_send` rather than spawning it as a fork). This puts
|
||||||
|
the credit-return cost on the receiver's `pe_exec_ns`, modeling the
|
||||||
|
IPCQ control-plane completing the consume-acknowledgement before
|
||||||
|
recv returns to the kernel — the protocol equivalent of a non-posted
|
||||||
|
`tl.store` waiting for an HBM ack on the raw DMA path.
|
||||||
|
|
||||||
That gives us:
|
That gives us:
|
||||||
|
|
||||||
- **Topology-proportional approximation**: an in-cube credit return is
|
- **Topology-proportional approximation**: an in-cube credit return is
|
||||||
automatically faster than a cross-SIP credit return.
|
automatically faster than a cross-SIP credit return.
|
||||||
- **No magic constants**: no arbitrary `ipcq_ctrl_latency_ns`.
|
- **No magic constants**: every nanosecond comes from
|
||||||
|
`compute_path_latency_ns` on the same edge_map and `node_overhead_ns`
|
||||||
|
as data traffic.
|
||||||
- **No deadlock risk**: unlike piggyback, B can issue credit even when
|
- **No deadlock risk**: unlike piggyback, B can issue credit even when
|
||||||
it has no data to send back.
|
it has no data to send back. `peer_credit_store.put` is unbounded.
|
||||||
- **Reuses existing utility**: `ComponentContext.compute_drain_ns`.
|
- **`IPCQ ≥ raw DMA`** for matched physical moves — the credit-emit
|
||||||
|
cost on recv balances the HBM ack-trip cost RAW pays on the sender.
|
||||||
|
|
||||||
#### Component coupling — SimPy Store channel
|
#### Component coupling — SimPy Store channel
|
||||||
|
|
||||||
@@ -420,11 +437,21 @@ fan-out (see `IpcqInitMsg` in D12).
|
|||||||
#### PE_DMA's added responsibility
|
#### PE_DMA's added responsibility
|
||||||
|
|
||||||
When `vc_comm` receives a token, PE_DMA processes it as the following
|
When `vc_comm` receives a token, PE_DMA processes it as the following
|
||||||
**atomic** sequence. **No SimPy yield is allowed between the two steps**
|
sequence: pay the Transaction's terminal BW drain, then atomically
|
||||||
(invariant I6):
|
write data and forward metadata. **No SimPy yield is allowed between
|
||||||
|
the data write and the metadata forward** (invariant I6). The drain
|
||||||
|
yield must sit before the atomic block, not inside it:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
def _on_vc_comm_recv(self, env, token):
|
def _on_vc_comm_recv(self, env, txn):
|
||||||
|
# Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the
|
||||||
|
# sender PE_DMA). MUST happen before the atomic block so recv only
|
||||||
|
# wakes after the bytes have "landed".
|
||||||
|
drain = getattr(txn, "drain_ns", 0.0)
|
||||||
|
if drain > 0:
|
||||||
|
yield env.timeout(drain)
|
||||||
|
|
||||||
|
token = txn.request
|
||||||
# ── ATOMIC: no yield between these two operations ──
|
# ── ATOMIC: no yield between these two operations ──
|
||||||
data = self._memory_store.read(token.src_space, token.src_addr,
|
data = self._memory_store.read(token.src_space, token.src_addr,
|
||||||
shape=..., dtype=...)
|
shape=..., dtype=...)
|
||||||
@@ -439,6 +466,33 @@ The final `put` is yieldable but uses an unbounded internal store, so
|
|||||||
it completes in a single step. That `put` is the closing call of the
|
it completes in a single step. That `put` is the closing call of the
|
||||||
atomic block; nothing may be inserted before it.
|
atomic block; nothing may be inserted before it.
|
||||||
|
|
||||||
|
#### Drain-at-inbound semantics (D9 timing model)
|
||||||
|
|
||||||
|
The Transaction carries `drain_ns = nbytes / bottleneck_bw_on_path`
|
||||||
|
stamped at send-side PE_DMA. In this simulator per-hop `overhead_ns`
|
||||||
|
is paid at each forwarding component via `run()`, and the remaining
|
||||||
|
BW drain is paid once at the Transaction's terminal. Every non-IPCQ
|
||||||
|
Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via
|
||||||
|
`ComponentBase._forward_txn` at the terminal node. For IPCQ the
|
||||||
|
destination PE_DMA intercepts the Transaction with `_handle_ipcq_inbound`
|
||||||
|
(so IPCQ-specific data write + metadata forward can happen), so **the
|
||||||
|
drain MUST be paid explicitly at the top of that handler** to keep
|
||||||
|
IPCQ's timing model on par with every other fabric Transaction.
|
||||||
|
|
||||||
|
Side-effects of paying drain here:
|
||||||
|
|
||||||
|
- **SRC `tl.send`** is unchanged — fire-and-forget semantics are
|
||||||
|
preserved because the sender PE_DMA does not `yield sub_done`. The
|
||||||
|
`sub_done.succeed()` call (made after metadata forward below) is an
|
||||||
|
event with no listener on the sender side.
|
||||||
|
- **DST `tl.recv`** unblocks `drain_ns` later. Since recv wakes only
|
||||||
|
when `IpcqMetaArrival` reaches its local PE_IPCQ, and the metadata
|
||||||
|
forward now happens after the drain, recv observes the full fabric
|
||||||
|
transfer time including bandwidth cost.
|
||||||
|
|
||||||
|
Matches the physical picture: send dispatches and leaves; recv waits
|
||||||
|
until the bytes have actually been drained into its inbox.
|
||||||
|
|
||||||
### D9.5. ADR-0020 (2-pass) integration
|
### D9.5. ADR-0020 (2-pass) integration
|
||||||
|
|
||||||
`tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase
|
`tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase
|
||||||
|
|||||||
@@ -365,23 +365,39 @@ data 경로의 piggyback 모델과 달리, credit return은 일반 vc_comm fabri
|
|||||||
거치지 않고 **별도 fast path**로 처리한다. 이는 실제 HW의 NVLink/UCIe
|
거치지 않고 **별도 fast path**로 처리한다. 이는 실제 HW의 NVLink/UCIe
|
||||||
credit return fast path를 추상화한 것이다.
|
credit return fast path를 추상화한 것이다.
|
||||||
|
|
||||||
**Latency 계산**: magic constant가 아니라 **라우팅 경로의 bottleneck BW**
|
**Latency 계산**: magic constant가 아니라 **라우팅 경로의 full path
|
||||||
기준으로 산출한다.
|
latency** (per-node overhead + edge propagation + drain) 기준으로
|
||||||
|
산출한다.
|
||||||
|
|
||||||
```
|
```
|
||||||
credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes)
|
credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes)
|
||||||
path = router.find_path(self_pe, peer_pe)
|
path = router.find_path(self_pe, peer_pe.pe_dma)
|
||||||
latency = compute_drain_ns(path, credit_size_bytes)
|
latency = compute_path_latency_ns(path, credit_size_bytes)
|
||||||
= credit_size_bytes / bottleneck_bw_on_path
|
= sum(edge.distance_mm * ns_per_mm)
|
||||||
|
+ sum(node_overhead_ns[n] for n in path)
|
||||||
|
+ credit_size_bytes / bottleneck_bw_on_path
|
||||||
```
|
```
|
||||||
|
|
||||||
|
router는 source에만 `.pe_dma`를 자동 부여하므로 destination에는 반드시
|
||||||
|
`.pe_dma` suffix를 명시해야 한다. 그렇지 않으면 `find_path`가 raise하고
|
||||||
|
credit이 0 cost로 silently teleport되는 latent bug가 발생한다 (이번
|
||||||
|
업데이트에서 수정됨).
|
||||||
|
|
||||||
|
`tl.recv`는 credit-emit 완료를 yield-from으로 기다린다 (이전에는
|
||||||
|
`env.process`로 fork). 이로써 credit-return cost가 receiver의
|
||||||
|
`pe_exec_ns`에 반영되어, IPCQ control-plane이 consume-acknowledgement를
|
||||||
|
완료한 뒤에야 recv가 kernel에 반환된다 — RAW DMA의 non-posted `tl.store`가
|
||||||
|
HBM ack-trip을 기다리는 것의 protocol-level 등가물이다.
|
||||||
|
|
||||||
이로써:
|
이로써:
|
||||||
- **토폴로지 비례 approximation**: cube 내 credit return과 cross-SIP credit이
|
- **토폴로지 비례 approximation**: cube 내 credit return과 cross-SIP credit이
|
||||||
자동으로 다른 latency를 가짐 (정확한 값은 아니지만 magic constant보다 의미 있음)
|
자동으로 다른 latency를 가짐
|
||||||
- **Magic constant 없음**: 별도 `ipcq_ctrl_latency_ns` 같은 임의 값 불필요
|
- **Magic constant 없음**: 모든 ns 값이 데이터 트래픽과 동일한 edge_map
|
||||||
- **Deadlock 위험 없음**: piggyback과 달리 B가 A에게 보낼 데이터가 없어도
|
및 `node_overhead_ns`에서 산출되는 `compute_path_latency_ns`로부터 옴
|
||||||
credit이 자동 발행됨
|
- **Deadlock 위험 없음**: `peer_credit_store.put`은 unbounded, B가 A에게
|
||||||
- **기존 utility 재사용**: `ComponentContext.compute_drain_ns` 그대로 사용
|
보낼 데이터가 없어도 credit이 자동 발행됨
|
||||||
|
- **`IPCQ ≥ raw DMA`** 보장: matched physical move에 대해 credit-emit이
|
||||||
|
RAW의 ack-trip cost와 균형을 이룸
|
||||||
|
|
||||||
```
|
```
|
||||||
PE B: tl.recv(W) → 데이터 가져감 → my_tail++
|
PE B: tl.recv(W) → 데이터 가져감 → my_tail++
|
||||||
@@ -426,11 +442,22 @@ backend init에서 IpcqInitMsg fan-out 시 양방향 fast path channel을 함께
|
|||||||
|
|
||||||
#### PE_DMA의 책임 추가
|
#### PE_DMA의 책임 추가
|
||||||
|
|
||||||
PE_DMA(vc_comm)는 token 수신 시 다음 atomic 시퀀스로 처리한다.
|
PE_DMA(vc_comm)는 token 수신 시 다음 시퀀스로 처리한다: Transaction
|
||||||
**두 동작 사이에 SimPy yield를 두어서는 안 된다** (I6 MUST 규칙 참조):
|
terminal의 BW drain을 먼저 지불하고, 이어서 atomic하게 data write +
|
||||||
|
metadata forward 수행. **data write와 metadata forward 사이에는 SimPy
|
||||||
|
yield를 두어서는 안 된다** (I6 MUST 규칙 참조). drain yield는 atomic
|
||||||
|
구간 안이 아니라 그 앞에 위치해야 한다:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
def _on_vc_comm_recv(self, env, token):
|
def _on_vc_comm_recv(self, env, txn):
|
||||||
|
# Sender PE_DMA가 찍어 둔 drain_ns (= nbytes / bottleneck_bw) 를
|
||||||
|
# 여기서 지불. atomic 구간보다 앞이어야 한다 — recv는 bytes가
|
||||||
|
# "도착"한 이후에만 깨어나야 하므로.
|
||||||
|
drain = getattr(txn, "drain_ns", 0.0)
|
||||||
|
if drain > 0:
|
||||||
|
yield env.timeout(drain)
|
||||||
|
|
||||||
|
token = txn.request
|
||||||
# ── ATOMIC: 두 동작 사이에 yield 금지 ──
|
# ── ATOMIC: 두 동작 사이에 yield 금지 ──
|
||||||
# 1. data를 dst_addr에 write (dst의 메모리 공간은 token.dst_endpoint.buffer_kind)
|
# 1. data를 dst_addr에 write (dst의 메모리 공간은 token.dst_endpoint.buffer_kind)
|
||||||
data = self._memory_store.read(token.src_space, token.src_addr,
|
data = self._memory_store.read(token.src_space, token.src_addr,
|
||||||
@@ -446,6 +473,32 @@ wire로 capacity가 unbounded인 store를 사용하므로 즉시 완료된다 (
|
|||||||
single-step). 이 최종 put이 atomic 구간의 끝이며, 그 이전에 다른 yield가
|
single-step). 이 최종 put이 atomic 구간의 끝이며, 그 이전에 다른 yield가
|
||||||
삽입되면 안 된다.
|
삽입되면 안 된다.
|
||||||
|
|
||||||
|
#### Drain-at-inbound semantics (D9 timing model)
|
||||||
|
|
||||||
|
Transaction은 sender PE_DMA가 `drain_ns = nbytes / bottleneck_bw_on_path`
|
||||||
|
를 찍어 둔 상태로 fabric에 들어간다. 이 simulator에서 per-hop `overhead_ns`
|
||||||
|
는 각 forwarding component의 `run()` 에서 지불되고, 남은 BW drain은
|
||||||
|
Transaction의 terminal node에서 한 번 지불된다. IPCQ가 아닌 모든
|
||||||
|
Transaction (raw DMA, kernel-launch fanout 등) 은
|
||||||
|
`ComponentBase._forward_txn` 이 terminal에서 이 drain을 지불한다. IPCQ의
|
||||||
|
경우 목적지 PE_DMA가 `_handle_ipcq_inbound` 핸들러로 Transaction을
|
||||||
|
가로채서 (IPCQ 전용 data write + metadata forward를 해야 하므로)
|
||||||
|
**이 핸들러 최상단에서 drain을 명시적으로 지불해야 한다** — 그래야 IPCQ의
|
||||||
|
timing model이 다른 모든 fabric Transaction과 동일선상에 놓인다.
|
||||||
|
|
||||||
|
여기서 drain을 지불할 때의 side-effect:
|
||||||
|
|
||||||
|
- **SRC `tl.send`**: 동작 불변. sender PE_DMA가 `sub_done` 을 `yield`
|
||||||
|
하지 않으므로 fire-and-forget 의미가 보존된다. metadata forward 이후
|
||||||
|
호출되는 `sub_done.succeed()` 는 sender 입장에서 listener가 없는 이벤트.
|
||||||
|
- **DST `tl.recv`**: `drain_ns` 만큼 늦게 깨어난다. recv는 local PE_IPCQ
|
||||||
|
의 `IpcqMetaArrival` 수신 시에만 wake되며, metadata forward가 drain
|
||||||
|
이후로 이동했으므로 recv는 bandwidth까지 포함한 전체 fabric transfer
|
||||||
|
시간을 관측하게 된다.
|
||||||
|
|
||||||
|
물리적 그림과 일치: send는 dispatch하고 바로 반환; recv는 bytes가 실제로
|
||||||
|
자신의 inbox로 drain될 때까지 대기.
|
||||||
|
|
||||||
#### Backpressure latency 정확도
|
#### Backpressure latency 정확도
|
||||||
|
|
||||||
backpressure 해제까지 걸리는 시간:
|
backpressure 해제까지 걸리는 시간:
|
||||||
|
|||||||
@@ -1,22 +1,24 @@
|
|||||||
"""SFR configuration for intercube + inter-SIP IPCQ wiring.
|
"""SFR configuration for the full IPCQ hardware wiring.
|
||||||
|
|
||||||
Provides ``configure_sfr_intercube_multisip`` which programs PE_IPCQ
|
Installs PE_IPCQ neighbor tables modeling the physical hardware.
|
||||||
neighbor tables for:
|
Wiring is independent of DPPolicy / kernel choice — the kernel decides
|
||||||
|
at runtime which links to use.
|
||||||
|
|
||||||
1. Intercube within each SIP — pe0 of every cube connects to pe0 of
|
Direction label namespaces (disjoint):
|
||||||
its N/S/E/W mesh neighbors (no wrap-around).
|
|
||||||
2. Inter-SIP on ALL cubes — pe0 of cube_c on sip_A connects to pe0 of
|
|
||||||
cube_c on each peer SIP, using ``global_E``/``global_W`` (ring) or
|
|
||||||
``global_N``/``global_S``/``global_E``/``global_W`` (mesh/torus)
|
|
||||||
direction labels. Wiring all cubes allows the kernel to
|
|
||||||
dynamically elect the root cube at runtime.
|
|
||||||
|
|
||||||
SIP-level topology is read from ``topology.yaml`` →
|
- Intra-cube PE-to-PE: ``intra_N / intra_S / intra_E / intra_W``
|
||||||
``system.sips.topology`` (e.g. ``ring_1d``, ``mesh_2d``).
|
Logical 2×4 PE grid within a cube (no wrap):
|
||||||
Intercube mesh dimensions come from ``sip.cube_mesh.w/h``.
|
|
||||||
|
|
||||||
Internally delegates to ``install_ipcq`` with a computed ``rank_to_pe``
|
Row 0: pe0 pe1 pe2 pe3
|
||||||
(pe0-only) and a closure-captured ``neighbors()`` function.
|
Row 1: pe4 pe5 pe6 pe7
|
||||||
|
|
||||||
|
- Intercube same-lane: ``N / S / E / W``
|
||||||
|
``pe_i of cube_A ↔ pe_i of cube_B`` across the 4×4 cube mesh
|
||||||
|
(no wrap). Every PE i ∈ [0..7] wired independently.
|
||||||
|
|
||||||
|
- Inter-SIP same-(cube, pe): ``global_N / global_S / global_E / global_W``
|
||||||
|
``pe_i of cube_c on sip_A ↔ pe_i of cube_c on sip_B`` per
|
||||||
|
``topology.yaml → system.sips.topology``.
|
||||||
"""
|
"""
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
@@ -27,12 +29,46 @@ from kernbench.ccl.install import install_ipcq
|
|||||||
from kernbench.ccl.topologies import _BUILTIN as _TOPO_BUILTINS
|
from kernbench.ccl.topologies import _BUILTIN as _TOPO_BUILTINS
|
||||||
|
|
||||||
|
|
||||||
|
# ── Intra-cube 2×4 PE grid ───────────────────────────────────────────
|
||||||
|
|
||||||
|
_PE_GRID_COLS = 4
|
||||||
|
_PE_GRID_ROWS = 2
|
||||||
|
_PES_PER_CUBE = _PE_GRID_COLS * _PE_GRID_ROWS # 8
|
||||||
|
|
||||||
|
|
||||||
|
def _intra_cube_neighbors(pe: int) -> dict[str, int]:
|
||||||
|
"""Logical 2×4 PE grid neighbors within a cube (no wrap).
|
||||||
|
|
||||||
|
Returns directions in the ``intra_*`` namespace.
|
||||||
|
"""
|
||||||
|
row, col = divmod(pe, _PE_GRID_COLS)
|
||||||
|
nbrs: dict[str, int] = {}
|
||||||
|
if col < _PE_GRID_COLS - 1:
|
||||||
|
nbrs["intra_E"] = row * _PE_GRID_COLS + (col + 1)
|
||||||
|
if col > 0:
|
||||||
|
nbrs["intra_W"] = row * _PE_GRID_COLS + (col - 1)
|
||||||
|
if row < _PE_GRID_ROWS - 1:
|
||||||
|
nbrs["intra_S"] = (row + 1) * _PE_GRID_COLS + col
|
||||||
|
if row > 0:
|
||||||
|
nbrs["intra_N"] = (row - 1) * _PE_GRID_COLS + col
|
||||||
|
return nbrs
|
||||||
|
|
||||||
|
|
||||||
|
# ── Public entry point ───────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
def configure_sfr_intercube_multisip(
|
def configure_sfr_intercube_multisip(
|
||||||
engine: Any,
|
engine: Any,
|
||||||
spec: dict,
|
spec: dict,
|
||||||
cfg: dict,
|
cfg: dict,
|
||||||
) -> dict[str, Any]:
|
) -> dict[str, Any]:
|
||||||
"""Wire IPCQ for intercube (pe0, mesh) + inter-SIP (pe0, all cubes).
|
"""Wire the full IPCQ hardware model.
|
||||||
|
|
||||||
|
Every PE on every cube on every SIP gets neighbor table entries for:
|
||||||
|
|
||||||
|
- intra-cube (2×4 grid) in the ``intra_*`` namespace
|
||||||
|
- intercube same-lane (4×4 cube mesh, no wrap) in ``N/S/E/W``
|
||||||
|
- inter-SIP same-(cube, pe) in ``global_*``
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
engine: GraphEngine with ``_components``.
|
engine: GraphEngine with ``_components``.
|
||||||
@@ -46,48 +82,71 @@ def configure_sfr_intercube_multisip(
|
|||||||
mesh_w = int(cm["w"])
|
mesh_w = int(cm["w"])
|
||||||
mesh_h = int(cm["h"])
|
mesh_h = int(cm["h"])
|
||||||
n_cubes = mesh_w * mesh_h
|
n_cubes = mesh_w * mesh_h
|
||||||
n_sips = int(spec.get("system", {}).get("sips", {}).get("count", 1))
|
sips_cfg = spec.get("system", {}).get("sips", {})
|
||||||
sip_topology = str(
|
n_sips = int(sips_cfg.get("count", 1))
|
||||||
spec.get("system", {}).get("sips", {}).get("topology", "ring_1d")
|
sip_topology = str(sips_cfg.get("topology", "ring_1d"))
|
||||||
)
|
sip_w = sips_cfg.get("w")
|
||||||
|
sip_h = sips_cfg.get("h")
|
||||||
|
sip_w = int(sip_w) if sip_w is not None else None
|
||||||
|
sip_h = int(sip_h) if sip_h is not None else None
|
||||||
|
|
||||||
if sip_topology not in _TOPO_BUILTINS:
|
if sip_topology not in _TOPO_BUILTINS:
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
f"Unknown sip topology '{sip_topology}'. "
|
f"Unknown sip topology '{sip_topology}'. "
|
||||||
f"Available: {list(_TOPO_BUILTINS)}"
|
f"Available: {list(_TOPO_BUILTINS)}"
|
||||||
)
|
)
|
||||||
sip_topo_fn = _TOPO_BUILTINS[sip_topology]
|
_sip_topo_fn_raw = _TOPO_BUILTINS[sip_topology]
|
||||||
|
|
||||||
world_size = n_sips * n_cubes
|
def sip_topo_fn(rank: int, ws: int) -> dict:
|
||||||
|
if sip_w is not None and sip_h is not None:
|
||||||
|
try:
|
||||||
|
return _sip_topo_fn_raw(rank, ws, w=sip_w, h=sip_h)
|
||||||
|
except TypeError:
|
||||||
|
pass
|
||||||
|
return _sip_topo_fn_raw(rank, ws)
|
||||||
|
|
||||||
|
pes_per_cube = _PES_PER_CUBE
|
||||||
|
world_size = n_sips * n_cubes * pes_per_cube
|
||||||
pe_idx_to_pe: list[tuple[int, int, int]] = [
|
pe_idx_to_pe: list[tuple[int, int, int]] = [
|
||||||
(sip, cube, 0)
|
(sip, cube, pe)
|
||||||
for sip in range(n_sips)
|
for sip in range(n_sips)
|
||||||
for cube in range(n_cubes)
|
for cube in range(n_cubes)
|
||||||
|
for pe in range(pes_per_cube)
|
||||||
]
|
]
|
||||||
|
|
||||||
|
def _pe_idx(sip: int, cube: int, pe: int) -> int:
|
||||||
|
return (sip * n_cubes + cube) * pes_per_cube + pe
|
||||||
|
|
||||||
def _neighbors(pe_idx: int, ws: int, _base: dict) -> dict[str, int]:
|
def _neighbors(pe_idx: int, ws: int, _base: dict) -> dict[str, int]:
|
||||||
sip = pe_idx // n_cubes
|
tmp = pe_idx
|
||||||
cube = pe_idx % n_cubes
|
pe = tmp % pes_per_cube
|
||||||
|
tmp //= pes_per_cube
|
||||||
|
cube = tmp % n_cubes
|
||||||
|
sip = tmp // n_cubes
|
||||||
row = cube // mesh_w
|
row = cube // mesh_w
|
||||||
col = cube % mesh_w
|
col = cube % mesh_w
|
||||||
|
|
||||||
nbrs: dict[str, int] = {}
|
nbrs: dict[str, int] = {}
|
||||||
|
|
||||||
# Intercube within SIP (mesh, no wrap-around)
|
# ── Intra-cube (intra_N/S/E/W) ──
|
||||||
if col < mesh_w - 1:
|
for d, peer_pe in _intra_cube_neighbors(pe).items():
|
||||||
nbrs["E"] = sip * n_cubes + (row * mesh_w + col + 1)
|
nbrs[d] = _pe_idx(sip, cube, peer_pe)
|
||||||
if col > 0:
|
|
||||||
nbrs["W"] = sip * n_cubes + (row * mesh_w + col - 1)
|
|
||||||
if row < mesh_h - 1:
|
|
||||||
nbrs["S"] = sip * n_cubes + ((row + 1) * mesh_w + col)
|
|
||||||
if row > 0:
|
|
||||||
nbrs["N"] = sip * n_cubes + ((row - 1) * mesh_w + col)
|
|
||||||
|
|
||||||
# Inter-SIP on ALL cubes
|
# ── Intercube same-lane (N/S/E/W, 4×4 no wrap) ──
|
||||||
|
if col < mesh_w - 1:
|
||||||
|
nbrs["E"] = _pe_idx(sip, row * mesh_w + (col + 1), pe)
|
||||||
|
if col > 0:
|
||||||
|
nbrs["W"] = _pe_idx(sip, row * mesh_w + (col - 1), pe)
|
||||||
|
if row < mesh_h - 1:
|
||||||
|
nbrs["S"] = _pe_idx(sip, (row + 1) * mesh_w + col, pe)
|
||||||
|
if row > 0:
|
||||||
|
nbrs["N"] = _pe_idx(sip, (row - 1) * mesh_w + col, pe)
|
||||||
|
|
||||||
|
# ── Inter-SIP same-(cube, pe) (global_*) ──
|
||||||
if n_sips > 1:
|
if n_sips > 1:
|
||||||
sip_nbrs = sip_topo_fn(sip, n_sips)
|
sip_nbrs = sip_topo_fn(sip, n_sips)
|
||||||
for d, peer_sip in sip_nbrs.items():
|
for d, peer_sip in sip_nbrs.items():
|
||||||
nbrs[f"global_{d}"] = peer_sip * n_cubes + cube
|
nbrs[f"global_{d}"] = _pe_idx(peer_sip, cube, pe)
|
||||||
|
|
||||||
return nbrs
|
return nbrs
|
||||||
|
|
||||||
|
|||||||
@@ -33,23 +33,41 @@ def ring_1d_unidir(rank: int, world_size: int) -> NeighborMap:
|
|||||||
return {"E": (rank + 1) % world_size}
|
return {"E": (rank + 1) % world_size}
|
||||||
|
|
||||||
|
|
||||||
def mesh_2d(rank: int, world_size: int) -> NeighborMap:
|
def _resolve_2d_dims(
|
||||||
"""Square 2D mesh (N/S/E/W).
|
world_size: int, w: int | None, h: int | None, name: str,
|
||||||
|
) -> tuple[int, int]:
|
||||||
Layout: rank = row * side + col, with side = sqrt(world_size).
|
if w is not None and h is not None:
|
||||||
Wrap-around (torus) on all four edges.
|
if w * h != world_size:
|
||||||
"""
|
raise ValueError(
|
||||||
|
f"{name}: w*h ({w}*{h}) != world_size ({world_size})"
|
||||||
|
)
|
||||||
|
return w, h
|
||||||
side = int(round(world_size ** 0.5))
|
side = int(round(world_size ** 0.5))
|
||||||
if side * side != world_size:
|
if side * side != world_size:
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
f"mesh_2d requires square world_size, got {world_size}"
|
f"{name} requires square world_size or explicit w,h, "
|
||||||
|
f"got {world_size}"
|
||||||
)
|
)
|
||||||
r, c = divmod(rank, side)
|
return side, side
|
||||||
|
|
||||||
|
|
||||||
|
def mesh_2d(
|
||||||
|
rank: int, world_size: int,
|
||||||
|
w: int | None = None, h: int | None = None,
|
||||||
|
) -> NeighborMap:
|
||||||
|
"""2D mesh (N/S/E/W) with wrap-around on all four edges.
|
||||||
|
|
||||||
|
Layout: rank = row * w + col. When w, h are given, supports
|
||||||
|
rectangular (e.g. 2x3) layouts. Otherwise falls back to square
|
||||||
|
side = sqrt(world_size).
|
||||||
|
"""
|
||||||
|
w, h = _resolve_2d_dims(world_size, w, h, "mesh_2d")
|
||||||
|
r, c = divmod(rank, w)
|
||||||
return {
|
return {
|
||||||
"N": ((r - 1) % side) * side + c,
|
"N": ((r - 1) % h) * w + c,
|
||||||
"S": ((r + 1) % side) * side + c,
|
"S": ((r + 1) % h) * w + c,
|
||||||
"W": r * side + (c - 1) % side,
|
"W": r * w + (c - 1) % w,
|
||||||
"E": r * side + (c + 1) % side,
|
"E": r * w + (c + 1) % w,
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
@@ -73,36 +91,30 @@ def tree_binary(rank: int, world_size: int) -> NeighborMap:
|
|||||||
return n
|
return n
|
||||||
|
|
||||||
|
|
||||||
def torus_2d(rank: int, world_size: int) -> NeighborMap:
|
def torus_2d(
|
||||||
"""Square 2D torus (N/S/E/W) with wrap-around on all edges.
|
rank: int, world_size: int,
|
||||||
|
w: int | None = None, h: int | None = None,
|
||||||
Alias for mesh_2d (which already wraps). Explicit name for clarity
|
) -> NeighborMap:
|
||||||
when used as a SIP-level topology.
|
"""2D torus (N/S/E/W) with wrap-around on all edges. Alias for mesh_2d."""
|
||||||
"""
|
return mesh_2d(rank, world_size, w=w, h=h)
|
||||||
return mesh_2d(rank, world_size)
|
|
||||||
|
|
||||||
|
|
||||||
def mesh_2d_no_wrap(rank: int, world_size: int) -> NeighborMap:
|
def mesh_2d_no_wrap(
|
||||||
"""Square 2D mesh (N/S/E/W) WITHOUT wrap-around.
|
rank: int, world_size: int,
|
||||||
|
w: int | None = None, h: int | None = None,
|
||||||
Edge nodes have fewer neighbors (no wrapping). Used for SIP-level
|
) -> NeighborMap:
|
||||||
topologies where physical links don't wrap.
|
"""2D mesh (N/S/E/W) WITHOUT wrap-around. Supports rectangular dims."""
|
||||||
"""
|
w, h = _resolve_2d_dims(world_size, w, h, "mesh_2d_no_wrap")
|
||||||
side = int(round(world_size ** 0.5))
|
r, c = divmod(rank, w)
|
||||||
if side * side != world_size:
|
|
||||||
raise ValueError(
|
|
||||||
f"mesh_2d_no_wrap requires square world_size, got {world_size}"
|
|
||||||
)
|
|
||||||
r, c = divmod(rank, side)
|
|
||||||
n: NeighborMap = {}
|
n: NeighborMap = {}
|
||||||
if r > 0:
|
if r > 0:
|
||||||
n["N"] = (r - 1) * side + c
|
n["N"] = (r - 1) * w + c
|
||||||
if r < side - 1:
|
if r < h - 1:
|
||||||
n["S"] = (r + 1) * side + c
|
n["S"] = (r + 1) * w + c
|
||||||
if c > 0:
|
if c > 0:
|
||||||
n["W"] = r * side + (c - 1)
|
n["W"] = r * w + (c - 1)
|
||||||
if c < side - 1:
|
if c < w - 1:
|
||||||
n["E"] = r * side + (c + 1)
|
n["E"] = r * w + (c + 1)
|
||||||
return n
|
return n
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -58,7 +58,18 @@ class IoCpuComponent(ComponentBase):
|
|||||||
self._pending[key] = (expected, received, parent_done)
|
self._pending[key] = (expected, received, parent_done)
|
||||||
|
|
||||||
def _dispatch_to_m_cpus(self, env: simpy.Environment, txn: Any) -> Generator:
|
def _dispatch_to_m_cpus(self, env: simpy.Environment, txn: Any) -> Generator:
|
||||||
"""Fan out sub-Transactions to target cube M_CPUs, wait for responses."""
|
"""Fan out sub-Transactions to target cube M_CPUs, wait for responses.
|
||||||
|
|
||||||
|
ADR-0009 D5 (extended): for KernelLaunchMsg, stamp a single global
|
||||||
|
target_start_ns = env.now + max(IO_CPU → any target PE_CPU path
|
||||||
|
latency across all target cubes). M_CPU passes this value through
|
||||||
|
unchanged; every PE in every cube yields until the same sim-time
|
||||||
|
before beginning kernel execution. Without this, cross-cube
|
||||||
|
launches would have each cube's M_CPU compute its own per-cube
|
||||||
|
barrier relative to its local env.now, leaving PEs on different
|
||||||
|
cubes out of sync (the "h3/h4 dispatch-offset artifact").
|
||||||
|
"""
|
||||||
|
import dataclasses
|
||||||
from kernbench.runtime_api.kernel import KernelLaunchMsg, MemoryReadMsg, MemoryWriteMsg
|
from kernbench.runtime_api.kernel import KernelLaunchMsg, MemoryReadMsg, MemoryWriteMsg
|
||||||
|
|
||||||
request = txn.request
|
request = txn.request
|
||||||
@@ -72,10 +83,60 @@ class IoCpuComponent(ComponentBase):
|
|||||||
txn.done.succeed()
|
txn.done.succeed()
|
||||||
return
|
return
|
||||||
|
|
||||||
|
# For KernelLaunchMsg, compute the global barrier once here so
|
||||||
|
# every downstream PE_CPU uses the same target_start_ns.
|
||||||
|
if isinstance(request, KernelLaunchMsg):
|
||||||
|
io_overhead = self.ctx.node_overhead_ns.get(self.node.id, 0.0)
|
||||||
|
global_max_latency = 0.0
|
||||||
|
pe_ids = self._resolve_pe_ids(
|
||||||
|
getattr(request, "target_pe", "all")
|
||||||
|
)
|
||||||
|
for sip, cube in cube_targets:
|
||||||
|
try:
|
||||||
|
m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
|
||||||
|
io_to_m_path = self.ctx.router.find_node_path(
|
||||||
|
self.node.id, m_cpu_id,
|
||||||
|
)
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
if len(io_to_m_path) < 2:
|
||||||
|
continue
|
||||||
|
leg1 = self.ctx.compute_path_latency_ns(
|
||||||
|
io_to_m_path, nbytes=0,
|
||||||
|
)
|
||||||
|
m_overhead = self.ctx.node_overhead_ns.get(m_cpu_id, 0.0)
|
||||||
|
for pe_id in pe_ids:
|
||||||
|
pe_cpu_id = (
|
||||||
|
f"sip{sip}.cube{cube}.pe{pe_id}.pe_cpu"
|
||||||
|
)
|
||||||
|
try:
|
||||||
|
m_to_pe_path = self.ctx.router.find_node_path(
|
||||||
|
m_cpu_id, pe_cpu_id,
|
||||||
|
)
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
if len(m_to_pe_path) < 2:
|
||||||
|
continue
|
||||||
|
leg2 = self.ctx.compute_path_latency_ns(
|
||||||
|
m_to_pe_path, nbytes=0,
|
||||||
|
)
|
||||||
|
latency = leg1 + leg2 - io_overhead - m_overhead
|
||||||
|
if latency > global_max_latency:
|
||||||
|
global_max_latency = latency
|
||||||
|
request = dataclasses.replace(
|
||||||
|
request,
|
||||||
|
target_start_ns=float(env.now) + global_max_latency,
|
||||||
|
)
|
||||||
|
|
||||||
# Setup aggregation
|
# Setup aggregation
|
||||||
self._pending[request.request_id] = (len(cube_targets), 0, txn.done)
|
self._pending[request.request_id] = (len(cube_targets), 0, txn.done)
|
||||||
|
|
||||||
# Fan out to each target cube's M_CPU
|
# Fan out to each target cube's M_CPU. Kernel-launch fanout
|
||||||
|
# carries control metadata only; nbytes is forced to 0 for
|
||||||
|
# KernelLaunchMsg so the launch sub-txns do not occupy data-fabric
|
||||||
|
# BW (would otherwise serialize 16 cubes worth of fanout on the
|
||||||
|
# shared first hop and break ADR-0009 D5's barrier prediction).
|
||||||
|
is_kernel_launch = isinstance(request, KernelLaunchMsg)
|
||||||
for sip, cube in cube_targets:
|
for sip, cube in cube_targets:
|
||||||
try:
|
try:
|
||||||
m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
|
m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
|
||||||
@@ -86,11 +147,25 @@ class IoCpuComponent(ComponentBase):
|
|||||||
continue
|
continue
|
||||||
sub_txn = Transaction(
|
sub_txn = Transaction(
|
||||||
request=request, path=path, step=0,
|
request=request, path=path, step=0,
|
||||||
nbytes=txn.nbytes, done=env.event(),
|
nbytes=0 if is_kernel_launch else txn.nbytes,
|
||||||
|
done=env.event(),
|
||||||
result_data=txn.result_data,
|
result_data=txn.result_data,
|
||||||
)
|
)
|
||||||
yield self.out_ports[path[1]].put(sub_txn.advance())
|
yield self.out_ports[path[1]].put(sub_txn.advance())
|
||||||
|
|
||||||
|
def _resolve_pe_ids(self, target_pe: Any) -> list[int]:
|
||||||
|
"""Resolve target_pe → list of PE indices (mirrors M_CPU logic)."""
|
||||||
|
if isinstance(target_pe, int):
|
||||||
|
return [target_pe]
|
||||||
|
if isinstance(target_pe, tuple):
|
||||||
|
return list(target_pe)
|
||||||
|
# "all": all PEs in a cube
|
||||||
|
n_slices = 8
|
||||||
|
if self.ctx and self.ctx.spec:
|
||||||
|
mm = self.ctx.spec.get("cube", {}).get("memory_map", {})
|
||||||
|
n_slices = mm.get("hbm_slices_per_cube", 8)
|
||||||
|
return list(range(n_slices))
|
||||||
|
|
||||||
def _resolve_cube_targets(self, request: Any) -> list[tuple[int, int]]:
|
def _resolve_cube_targets(self, request: Any) -> list[tuple[int, int]]:
|
||||||
"""Return list of (sip, cube) pairs to fan out to."""
|
"""Return list of (sip, cube) pairs to fan out to."""
|
||||||
from kernbench.runtime_api.kernel import (
|
from kernbench.runtime_api.kernel import (
|
||||||
|
|||||||
@@ -162,7 +162,11 @@ class MCpuComponent(ComponentBase):
|
|||||||
Routes through find_node_path (M_CPU → NOC → PE_CPU command edges).
|
Routes through find_node_path (M_CPU → NOC → PE_CPU command edges).
|
||||||
PE_CPU sends ResponseMsg back via NOC → M_CPU on completion.
|
PE_CPU sends ResponseMsg back via NOC → M_CPU on completion.
|
||||||
Then sends aggregate ResponseMsg back to IO_CPU on the reverse path.
|
Then sends aggregate ResponseMsg back to IO_CPU on the reverse path.
|
||||||
|
|
||||||
|
ADR-0009 D5: stamps target_start_ns so every PE in this fanout
|
||||||
|
starts executing at the same env.now regardless of dispatch path.
|
||||||
"""
|
"""
|
||||||
|
import dataclasses
|
||||||
request = txn.request
|
request = txn.request
|
||||||
target_pe = getattr(request, "target_pe", "all")
|
target_pe = getattr(request, "target_pe", "all")
|
||||||
cube_prefix = self.node.id.rsplit(".", 1)[0] # e.g. "sip0.cube0"
|
cube_prefix = self.node.id.rsplit(".", 1)[0] # e.g. "sip0.cube0"
|
||||||
@@ -172,9 +176,13 @@ class MCpuComponent(ComponentBase):
|
|||||||
txn.done.succeed()
|
txn.done.succeed()
|
||||||
return
|
return
|
||||||
|
|
||||||
# Fan out to each PE_CPU, using response-based aggregation
|
# Resolve per-PE paths. If IO_CPU already stamped a global
|
||||||
sub_txns: list[Transaction] = []
|
# target_start_ns (ADR-0009 D5 extended), pass it through
|
||||||
n_dispatched = 0
|
# unchanged so every PE across every cube uses the same barrier.
|
||||||
|
# Otherwise (e.g. direct-to-M_CPU launch in a unit test) compute
|
||||||
|
# a per-cube barrier from env.now.
|
||||||
|
per_pe: list[tuple[int, list[str], float]] = []
|
||||||
|
max_latency = 0.0
|
||||||
for pe_id in pe_ids:
|
for pe_id in pe_ids:
|
||||||
pe_cpu_id = f"{cube_prefix}.pe{pe_id}.pe_cpu"
|
pe_cpu_id = f"{cube_prefix}.pe{pe_id}.pe_cpu"
|
||||||
try:
|
try:
|
||||||
@@ -183,8 +191,24 @@ class MCpuComponent(ComponentBase):
|
|||||||
continue
|
continue
|
||||||
if len(path) < 2:
|
if len(path) < 2:
|
||||||
continue
|
continue
|
||||||
|
latency = self.ctx.compute_path_latency_ns(path, nbytes=0)
|
||||||
|
per_pe.append((pe_id, path, latency))
|
||||||
|
if latency > max_latency:
|
||||||
|
max_latency = latency
|
||||||
|
|
||||||
|
if getattr(request, "target_start_ns", None) is not None:
|
||||||
|
stamped_request = request
|
||||||
|
else:
|
||||||
|
stamped_request = dataclasses.replace(
|
||||||
|
request, target_start_ns=float(env.now) + max_latency,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Fan out to each PE_CPU, using response-based aggregation
|
||||||
|
sub_txns: list[Transaction] = []
|
||||||
|
n_dispatched = 0
|
||||||
|
for pe_id, path, _lat in per_pe:
|
||||||
sub_txn = Transaction(
|
sub_txn = Transaction(
|
||||||
request=request, path=path, step=0,
|
request=stamped_request, path=path, step=0,
|
||||||
nbytes=0, done=env.event(),
|
nbytes=0, done=env.event(),
|
||||||
)
|
)
|
||||||
yield self.out_ports[path[1]].put(sub_txn.advance())
|
yield self.out_ports[path[1]].put(sub_txn.advance())
|
||||||
@@ -204,16 +228,21 @@ class MCpuComponent(ComponentBase):
|
|||||||
yield all_done
|
yield all_done
|
||||||
del self._parent_txns[request.request_id]
|
del self._parent_txns[request.request_id]
|
||||||
|
|
||||||
# Aggregate PE-internal metrics (max across PEs)
|
# Aggregate PE-internal metrics (max across PEs and across cubes).
|
||||||
|
# Multiple M_CPUs share the same result_data dict via IO_CPU fanout;
|
||||||
|
# merge against the existing value so cubes don't clobber each other.
|
||||||
pe_exec_values = [st.result_data.get("pe_exec_ns", 0.0) for st in sub_txns]
|
pe_exec_values = [st.result_data.get("pe_exec_ns", 0.0) for st in sub_txns]
|
||||||
if pe_exec_values:
|
if pe_exec_values:
|
||||||
txn.result_data["pe_exec_ns"] = max(pe_exec_values)
|
cur = txn.result_data.get("pe_exec_ns", 0.0) or 0.0
|
||||||
|
txn.result_data["pe_exec_ns"] = max(cur, max(pe_exec_values))
|
||||||
dma_values = [st.result_data.get("dma_ns", 0.0) for st in sub_txns]
|
dma_values = [st.result_data.get("dma_ns", 0.0) for st in sub_txns]
|
||||||
if dma_values:
|
if dma_values:
|
||||||
txn.result_data["dma_ns"] = max(dma_values)
|
cur = txn.result_data.get("dma_ns", 0.0) or 0.0
|
||||||
|
txn.result_data["dma_ns"] = max(cur, max(dma_values))
|
||||||
compute_values = [st.result_data.get("compute_ns", 0.0) for st in sub_txns]
|
compute_values = [st.result_data.get("compute_ns", 0.0) for st in sub_txns]
|
||||||
if compute_values:
|
if compute_values:
|
||||||
txn.result_data["compute_ns"] = max(compute_values)
|
cur = txn.result_data.get("compute_ns", 0.0) or 0.0
|
||||||
|
txn.result_data["compute_ns"] = max(cur, max(compute_values))
|
||||||
|
|
||||||
# Send aggregate response on reverse command path back to IO_CPU
|
# Send aggregate response on reverse command path back to IO_CPU
|
||||||
reverse_path = list(reversed(txn.path))
|
reverse_path = list(reversed(txn.path))
|
||||||
|
|||||||
@@ -95,6 +95,13 @@ class PeCpuComponent(ComponentBase):
|
|||||||
request = txn.request
|
request = txn.request
|
||||||
yield from self.run(env, 0)
|
yield from self.run(env, 0)
|
||||||
|
|
||||||
|
# ADR-0009 D5: synchronized launch barrier. If M_CPU stamped a
|
||||||
|
# target_start_ns, wait until then so every PE in this launch
|
||||||
|
# begins pe_exec measurement at the same simulated time.
|
||||||
|
target_start = getattr(request, "target_start_ns", None)
|
||||||
|
if target_start is not None and target_start > env.now:
|
||||||
|
yield env.timeout(float(target_start) - env.now)
|
||||||
|
|
||||||
kernel_fn = get_kernel(request.kernel_ref.name)
|
kernel_fn = get_kernel(request.kernel_ref.name)
|
||||||
num_programs = self._derive_num_programs(request)
|
num_programs = self._derive_num_programs(request)
|
||||||
kernel_args = self._unpack_kernel_args(request)
|
kernel_args = self._unpack_kernel_args(request)
|
||||||
|
|||||||
@@ -186,13 +186,37 @@ class PeDmaComponent(PeEngineBase):
|
|||||||
# ── IPCQ inbound (fabric → PE_DMA → MemoryStore + PE_IPCQ) ──────
|
# ── IPCQ inbound (fabric → PE_DMA → MemoryStore + PE_IPCQ) ──────
|
||||||
|
|
||||||
def _handle_ipcq_inbound(self, env: simpy.Environment, txn: Any) -> Generator:
|
def _handle_ipcq_inbound(self, env: simpy.Environment, txn: Any) -> Generator:
|
||||||
"""At destination PE_DMA: atomically write data and forward metadata.
|
"""At destination PE_DMA: pay terminal drain, then atomically write
|
||||||
|
data and forward metadata.
|
||||||
|
|
||||||
|
ADR-0023 D9 (drain at inbound terminal): the Transaction carries
|
||||||
|
``drain_ns = nbytes / bottleneck_bw_on_path`` stamped by the sender
|
||||||
|
PE_DMA. Like every other Transaction terminal in the simulator (see
|
||||||
|
``ComponentBase._forward_txn``), this drain must be paid when the
|
||||||
|
Transaction reaches its destination. SRC-side ``tl.send`` is
|
||||||
|
fire-and-forget — it never yields on ``sub_done`` — so paying the
|
||||||
|
drain here does NOT delay the sender. What it DOES delay is the
|
||||||
|
IpcqMetaArrival forwarded below: that delay is the only signal
|
||||||
|
``tl.recv`` on DST blocks on, which is exactly the desired
|
||||||
|
semantics — "send dispatches and returns; recv waits until the
|
||||||
|
bytes have actually landed in its inbox".
|
||||||
|
|
||||||
|
The drain MUST be paid before the atomic block — inserting a yield
|
||||||
|
inside would break invariant I6.
|
||||||
|
|
||||||
I6 (MUST): no SimPy yield between MemoryStore.write and the
|
I6 (MUST): no SimPy yield between MemoryStore.write and the
|
||||||
IpcqMetaArrival put into PE_IPCQ.
|
IpcqMetaArrival put into PE_IPCQ.
|
||||||
"""
|
"""
|
||||||
from kernbench.common.ipcq_types import IpcqMetaArrival
|
from kernbench.common.ipcq_types import IpcqMetaArrival
|
||||||
|
|
||||||
|
# Pay terminal BW drain before the atomic write/metadata forward.
|
||||||
|
# Without this, IPCQ effectively got fabric bandwidth for free at
|
||||||
|
# the terminal (only intermediate-hop overhead_ns was charged),
|
||||||
|
# making IPCQ lower than raw DMA at large sizes in benchmarks.
|
||||||
|
drain = getattr(txn, "drain_ns", 0.0)
|
||||||
|
if drain > 0:
|
||||||
|
yield env.timeout(drain)
|
||||||
|
|
||||||
token = txn.request
|
token = txn.request
|
||||||
|
|
||||||
# ── ATOMIC: do not introduce yield between these two operations ──
|
# ── ATOMIC: do not introduce yield between these two operations ──
|
||||||
|
|||||||
@@ -338,9 +338,13 @@ class PeIpcqComponent(ComponentBase):
|
|||||||
nbytes=req.result_data.get("nbytes", 0),
|
nbytes=req.result_data.get("nbytes", 0),
|
||||||
)
|
)
|
||||||
|
|
||||||
# Fast path credit return — bottleneck BW based latency
|
# Credit return: recv blocks on credit-emit so the protocol cost
|
||||||
env.process(
|
# (full path latency to deliver the credit metadata back to the
|
||||||
self._delayed_credit_send(env, direction, qp["peer_credit_store"], qp["my_tail"])
|
# sender) is reflected in the recv's pe_exec_ns. Models the IPCQ
|
||||||
|
# control-plane completing the consume-acknowledgement before
|
||||||
|
# recv returns to the kernel.
|
||||||
|
yield from self._delayed_credit_send(
|
||||||
|
env, direction, qp["peer_credit_store"], qp["my_tail"],
|
||||||
)
|
)
|
||||||
|
|
||||||
if not req.done.triggered:
|
if not req.done.triggered:
|
||||||
@@ -455,7 +459,12 @@ class PeIpcqComponent(ComponentBase):
|
|||||||
yield peer_credit_store.put(meta)
|
yield peer_credit_store.put(meta)
|
||||||
|
|
||||||
def _credit_latency_ns(self, direction: str) -> float:
|
def _credit_latency_ns(self, direction: str) -> float:
|
||||||
"""Compute credit fast path latency = credit_size / bottleneck_bw.
|
"""Full path latency for the credit-return packet.
|
||||||
|
|
||||||
|
Pays per-node overhead + edge prop + drain along the same fabric
|
||||||
|
the data took. PathRouter.find_path() auto-appends ".pe_dma" to
|
||||||
|
the source only, so the destination MUST be spelled with the
|
||||||
|
explicit ".pe_dma" suffix.
|
||||||
|
|
||||||
Falls back to 0 when ctx/router is unavailable (unit-test mode).
|
Falls back to 0 when ctx/router is unavailable (unit-test mode).
|
||||||
"""
|
"""
|
||||||
@@ -463,10 +472,12 @@ class PeIpcqComponent(ComponentBase):
|
|||||||
return 0.0
|
return 0.0
|
||||||
qp = self._queue_pairs[direction]
|
qp = self._queue_pairs[direction]
|
||||||
peer = qp["peer"]
|
peer = qp["peer"]
|
||||||
peer_pe_prefix = f"sip{peer.sip}.cube{peer.cube}.pe{peer.pe}"
|
peer_pe_dma = f"sip{peer.sip}.cube{peer.cube}.pe{peer.pe}.pe_dma"
|
||||||
try:
|
try:
|
||||||
path = self.ctx.router.find_path(self._pe_prefix, peer_pe_prefix)
|
path = self.ctx.router.find_path(self._pe_prefix, peer_pe_dma)
|
||||||
return self.ctx.compute_drain_ns(path, self._credit_size_bytes)
|
return self.ctx.compute_path_latency_ns(
|
||||||
|
path, self._credit_size_bytes,
|
||||||
|
)
|
||||||
except Exception:
|
except Exception:
|
||||||
return 0.0
|
return 0.0
|
||||||
|
|
||||||
|
|||||||
@@ -26,6 +26,9 @@ class ComponentContext:
|
|||||||
spec: dict = field(default_factory=dict) # topology spec (cube layout, PE count, etc.)
|
spec: dict = field(default_factory=dict) # topology spec (cube layout, PE count, etc.)
|
||||||
memory_store: Any = None # MemoryStore for Phase 1 data-aware execution (ADR-0020)
|
memory_store: Any = None # MemoryStore for Phase 1 data-aware execution (ADR-0020)
|
||||||
op_logger: Any = None # OpLogger for Phase 1 op recording (ADR-0020)
|
op_logger: Any = None # OpLogger for Phase 1 op recording (ADR-0020)
|
||||||
|
# node_id -> overhead_ns (ADR-0009 D5: used by M_CPU to compute per-PE
|
||||||
|
# dispatch latency when stamping target_start_ns on KernelLaunchMsg).
|
||||||
|
node_overhead_ns: dict[str, float] = field(default_factory=dict)
|
||||||
|
|
||||||
def get_shared_resource(
|
def get_shared_resource(
|
||||||
self, env: simpy.Environment, key: str, capacity: int = 1,
|
self, env: simpy.Environment, key: str, capacity: int = 1,
|
||||||
@@ -52,3 +55,19 @@ class ComponentContext:
|
|||||||
if min_bw == float("inf"):
|
if min_bw == float("inf"):
|
||||||
return 0.0
|
return 0.0
|
||||||
return nbytes / min_bw
|
return nbytes / min_bw
|
||||||
|
|
||||||
|
def compute_path_latency_ns(self, path: list[str], nbytes: int = 0) -> float:
|
||||||
|
"""Formula latency along path: wire + per-node overhead + drain.
|
||||||
|
|
||||||
|
ADR-0009 D5: M_CPU uses this to compute per-PE dispatch latency
|
||||||
|
when stamping target_start_ns on KernelLaunchMsg fanout.
|
||||||
|
"""
|
||||||
|
total = 0.0
|
||||||
|
for i in range(len(path) - 1):
|
||||||
|
edge = self.edge_map.get((path[i], path[i + 1]))
|
||||||
|
if edge:
|
||||||
|
total += edge.distance_mm * self.ns_per_mm
|
||||||
|
for node_id in path:
|
||||||
|
total += self.node_overhead_ns.get(node_id, 0.0)
|
||||||
|
total += self.compute_drain_ns(path, nbytes)
|
||||||
|
return total
|
||||||
|
|||||||
@@ -58,7 +58,13 @@ class IoCpuComponent(ComponentBase):
|
|||||||
self._pending[key] = (expected, received, parent_done)
|
self._pending[key] = (expected, received, parent_done)
|
||||||
|
|
||||||
def _dispatch_to_m_cpus(self, env: simpy.Environment, txn: Any) -> Generator:
|
def _dispatch_to_m_cpus(self, env: simpy.Environment, txn: Any) -> Generator:
|
||||||
"""Fan out sub-Transactions to target cube M_CPUs, wait for responses."""
|
"""Fan out sub-Transactions to target cube M_CPUs, wait for responses.
|
||||||
|
|
||||||
|
ADR-0009 D5 (extended): stamp a global target_start_ns on
|
||||||
|
KernelLaunchMsg so every PE across every target cube starts at
|
||||||
|
the same env.now. See the non-legacy builtin for full rationale.
|
||||||
|
"""
|
||||||
|
import dataclasses
|
||||||
from kernbench.runtime_api.kernel import KernelLaunchMsg, MemoryReadMsg, MemoryWriteMsg
|
from kernbench.runtime_api.kernel import KernelLaunchMsg, MemoryReadMsg, MemoryWriteMsg
|
||||||
|
|
||||||
request = txn.request
|
request = txn.request
|
||||||
@@ -72,10 +78,53 @@ class IoCpuComponent(ComponentBase):
|
|||||||
txn.done.succeed()
|
txn.done.succeed()
|
||||||
return
|
return
|
||||||
|
|
||||||
|
if isinstance(request, KernelLaunchMsg):
|
||||||
|
io_overhead = self.ctx.node_overhead_ns.get(self.node.id, 0.0)
|
||||||
|
global_max_latency = 0.0
|
||||||
|
pe_ids = self._resolve_pe_ids(
|
||||||
|
getattr(request, "target_pe", "all")
|
||||||
|
)
|
||||||
|
for sip, cube in cube_targets:
|
||||||
|
try:
|
||||||
|
m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
|
||||||
|
io_to_m_path = self.ctx.router.find_node_path(
|
||||||
|
self.node.id, m_cpu_id,
|
||||||
|
)
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
if len(io_to_m_path) < 2:
|
||||||
|
continue
|
||||||
|
leg1 = self.ctx.compute_path_latency_ns(
|
||||||
|
io_to_m_path, nbytes=0,
|
||||||
|
)
|
||||||
|
m_overhead = self.ctx.node_overhead_ns.get(m_cpu_id, 0.0)
|
||||||
|
for pe_id in pe_ids:
|
||||||
|
pe_cpu_id = (
|
||||||
|
f"sip{sip}.cube{cube}.pe{pe_id}.pe_cpu"
|
||||||
|
)
|
||||||
|
try:
|
||||||
|
m_to_pe_path = self.ctx.router.find_node_path(
|
||||||
|
m_cpu_id, pe_cpu_id,
|
||||||
|
)
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
if len(m_to_pe_path) < 2:
|
||||||
|
continue
|
||||||
|
leg2 = self.ctx.compute_path_latency_ns(
|
||||||
|
m_to_pe_path, nbytes=0,
|
||||||
|
)
|
||||||
|
latency = leg1 + leg2 - io_overhead - m_overhead
|
||||||
|
if latency > global_max_latency:
|
||||||
|
global_max_latency = latency
|
||||||
|
request = dataclasses.replace(
|
||||||
|
request,
|
||||||
|
target_start_ns=float(env.now) + global_max_latency,
|
||||||
|
)
|
||||||
|
|
||||||
# Setup aggregation
|
# Setup aggregation
|
||||||
self._pending[request.request_id] = (len(cube_targets), 0, txn.done)
|
self._pending[request.request_id] = (len(cube_targets), 0, txn.done)
|
||||||
|
|
||||||
# Fan out to each target cube's M_CPU
|
is_kernel_launch = isinstance(request, KernelLaunchMsg)
|
||||||
for sip, cube in cube_targets:
|
for sip, cube in cube_targets:
|
||||||
try:
|
try:
|
||||||
m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
|
m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
|
||||||
@@ -86,11 +135,24 @@ class IoCpuComponent(ComponentBase):
|
|||||||
continue
|
continue
|
||||||
sub_txn = Transaction(
|
sub_txn = Transaction(
|
||||||
request=request, path=path, step=0,
|
request=request, path=path, step=0,
|
||||||
nbytes=txn.nbytes, done=env.event(),
|
nbytes=0 if is_kernel_launch else txn.nbytes,
|
||||||
|
done=env.event(),
|
||||||
result_data=txn.result_data,
|
result_data=txn.result_data,
|
||||||
)
|
)
|
||||||
yield self.out_ports[path[1]].put(sub_txn.advance())
|
yield self.out_ports[path[1]].put(sub_txn.advance())
|
||||||
|
|
||||||
|
def _resolve_pe_ids(self, target_pe: Any) -> list[int]:
|
||||||
|
"""Resolve target_pe → list of PE indices (mirrors M_CPU logic)."""
|
||||||
|
if isinstance(target_pe, int):
|
||||||
|
return [target_pe]
|
||||||
|
if isinstance(target_pe, tuple):
|
||||||
|
return list(target_pe)
|
||||||
|
n_slices = 8
|
||||||
|
if self.ctx and self.ctx.spec:
|
||||||
|
mm = self.ctx.spec.get("cube", {}).get("memory_map", {})
|
||||||
|
n_slices = mm.get("hbm_slices_per_cube", 8)
|
||||||
|
return list(range(n_slices))
|
||||||
|
|
||||||
def _resolve_cube_targets(self, request: Any) -> list[tuple[int, int]]:
|
def _resolve_cube_targets(self, request: Any) -> list[tuple[int, int]]:
|
||||||
"""Return list of (sip, cube) pairs to fan out to."""
|
"""Return list of (sip, cube) pairs to fan out to."""
|
||||||
from kernbench.runtime_api.kernel import (
|
from kernbench.runtime_api.kernel import (
|
||||||
|
|||||||
@@ -162,7 +162,11 @@ class MCpuComponent(ComponentBase):
|
|||||||
Routes through find_node_path (M_CPU → NOC → PE_CPU command edges).
|
Routes through find_node_path (M_CPU → NOC → PE_CPU command edges).
|
||||||
PE_CPU sends ResponseMsg back via NOC → M_CPU on completion.
|
PE_CPU sends ResponseMsg back via NOC → M_CPU on completion.
|
||||||
Then sends aggregate ResponseMsg back to IO_CPU on the reverse path.
|
Then sends aggregate ResponseMsg back to IO_CPU on the reverse path.
|
||||||
|
|
||||||
|
ADR-0009 D5: stamps target_start_ns so every PE in this fanout
|
||||||
|
starts executing at the same env.now regardless of dispatch path.
|
||||||
"""
|
"""
|
||||||
|
import dataclasses
|
||||||
request = txn.request
|
request = txn.request
|
||||||
target_pe = getattr(request, "target_pe", "all")
|
target_pe = getattr(request, "target_pe", "all")
|
||||||
cube_prefix = self.node.id.rsplit(".", 1)[0] # e.g. "sip0.cube0"
|
cube_prefix = self.node.id.rsplit(".", 1)[0] # e.g. "sip0.cube0"
|
||||||
@@ -172,9 +176,10 @@ class MCpuComponent(ComponentBase):
|
|||||||
txn.done.succeed()
|
txn.done.succeed()
|
||||||
return
|
return
|
||||||
|
|
||||||
# Fan out to each PE_CPU, using response-based aggregation
|
# Resolve per-PE paths. If IO_CPU already stamped a global
|
||||||
sub_txns: list[Transaction] = []
|
# target_start_ns (ADR-0009 D5 extended), pass it through.
|
||||||
n_dispatched = 0
|
per_pe: list[tuple[int, list[str], float]] = []
|
||||||
|
max_latency = 0.0
|
||||||
for pe_id in pe_ids:
|
for pe_id in pe_ids:
|
||||||
pe_cpu_id = f"{cube_prefix}.pe{pe_id}.pe_cpu"
|
pe_cpu_id = f"{cube_prefix}.pe{pe_id}.pe_cpu"
|
||||||
try:
|
try:
|
||||||
@@ -183,8 +188,24 @@ class MCpuComponent(ComponentBase):
|
|||||||
continue
|
continue
|
||||||
if len(path) < 2:
|
if len(path) < 2:
|
||||||
continue
|
continue
|
||||||
|
latency = self.ctx.compute_path_latency_ns(path, nbytes=0)
|
||||||
|
per_pe.append((pe_id, path, latency))
|
||||||
|
if latency > max_latency:
|
||||||
|
max_latency = latency
|
||||||
|
|
||||||
|
if getattr(request, "target_start_ns", None) is not None:
|
||||||
|
stamped_request = request
|
||||||
|
else:
|
||||||
|
stamped_request = dataclasses.replace(
|
||||||
|
request, target_start_ns=float(env.now) + max_latency,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Fan out to each PE_CPU, using response-based aggregation
|
||||||
|
sub_txns: list[Transaction] = []
|
||||||
|
n_dispatched = 0
|
||||||
|
for pe_id, path, _lat in per_pe:
|
||||||
sub_txn = Transaction(
|
sub_txn = Transaction(
|
||||||
request=request, path=path, step=0,
|
request=stamped_request, path=path, step=0,
|
||||||
nbytes=0, done=env.event(),
|
nbytes=0, done=env.event(),
|
||||||
)
|
)
|
||||||
yield self.out_ports[path[1]].put(sub_txn.advance())
|
yield self.out_ports[path[1]].put(sub_txn.advance())
|
||||||
@@ -204,16 +225,21 @@ class MCpuComponent(ComponentBase):
|
|||||||
yield all_done
|
yield all_done
|
||||||
del self._parent_txns[request.request_id]
|
del self._parent_txns[request.request_id]
|
||||||
|
|
||||||
# Aggregate PE-internal metrics (max across PEs)
|
# Aggregate PE-internal metrics (max across PEs and across cubes).
|
||||||
|
# Multiple M_CPUs share the same result_data dict via IO_CPU fanout;
|
||||||
|
# merge against the existing value so cubes don't clobber each other.
|
||||||
pe_exec_values = [st.result_data.get("pe_exec_ns", 0.0) for st in sub_txns]
|
pe_exec_values = [st.result_data.get("pe_exec_ns", 0.0) for st in sub_txns]
|
||||||
if pe_exec_values:
|
if pe_exec_values:
|
||||||
txn.result_data["pe_exec_ns"] = max(pe_exec_values)
|
cur = txn.result_data.get("pe_exec_ns", 0.0) or 0.0
|
||||||
|
txn.result_data["pe_exec_ns"] = max(cur, max(pe_exec_values))
|
||||||
dma_values = [st.result_data.get("dma_ns", 0.0) for st in sub_txns]
|
dma_values = [st.result_data.get("dma_ns", 0.0) for st in sub_txns]
|
||||||
if dma_values:
|
if dma_values:
|
||||||
txn.result_data["dma_ns"] = max(dma_values)
|
cur = txn.result_data.get("dma_ns", 0.0) or 0.0
|
||||||
|
txn.result_data["dma_ns"] = max(cur, max(dma_values))
|
||||||
compute_values = [st.result_data.get("compute_ns", 0.0) for st in sub_txns]
|
compute_values = [st.result_data.get("compute_ns", 0.0) for st in sub_txns]
|
||||||
if compute_values:
|
if compute_values:
|
||||||
txn.result_data["compute_ns"] = max(compute_values)
|
cur = txn.result_data.get("compute_ns", 0.0) or 0.0
|
||||||
|
txn.result_data["compute_ns"] = max(cur, max(compute_values))
|
||||||
|
|
||||||
# Send aggregate response on reverse command path back to IO_CPU
|
# Send aggregate response on reverse command path back to IO_CPU
|
||||||
reverse_path = list(reversed(txn.path))
|
reverse_path = list(reversed(txn.path))
|
||||||
|
|||||||
@@ -71,6 +71,13 @@ class PeCpuComponent(ComponentBase):
|
|||||||
request = txn.request
|
request = txn.request
|
||||||
yield from self.run(env, 0)
|
yield from self.run(env, 0)
|
||||||
|
|
||||||
|
# ADR-0009 D5: synchronized launch barrier. If M_CPU stamped a
|
||||||
|
# target_start_ns, wait until then so every PE in this launch
|
||||||
|
# begins pe_exec measurement at the same simulated time.
|
||||||
|
target_start = getattr(request, "target_start_ns", None)
|
||||||
|
if target_start is not None and target_start > env.now:
|
||||||
|
yield env.timeout(float(target_start) - env.now)
|
||||||
|
|
||||||
kernel_fn = get_kernel(request.kernel_ref.name)
|
kernel_fn = get_kernel(request.kernel_ref.name)
|
||||||
num_programs = self._derive_num_programs(request)
|
num_programs = self._derive_num_programs(request)
|
||||||
kernel_args = self._unpack_kernel_args(request)
|
kernel_args = self._unpack_kernel_args(request)
|
||||||
|
|||||||
@@ -19,7 +19,14 @@ class PageFault(Exception):
|
|||||||
|
|
||||||
|
|
||||||
class PeMMU:
|
class PeMMU:
|
||||||
"""Per-PE MMU with page-aligned VA→PA translation table.
|
"""Per-PE MMU with sub-page-capable VA→PA translation table.
|
||||||
|
|
||||||
|
Each page-table entry is a list of (start_in_page, end_in_page,
|
||||||
|
pa_at_offset_zero) regions. This is a SIMULATOR STOPGAP — real MMUs
|
||||||
|
store one PA per page-table entry. Sub-page regions exist here so
|
||||||
|
DPPolicy layouts that shard below page granularity (e.g. 128 B
|
||||||
|
payloads with 4 KB pages) don't silently mis-route through last-
|
||||||
|
write-wins overwrites. Memory note: project_mmu_subpage_stopgap.md.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
page_size: Page size in bytes (default 2 MB).
|
page_size: Page size in bytes (default 2 MB).
|
||||||
@@ -34,7 +41,11 @@ class PeMMU:
|
|||||||
self._page_size = page_size
|
self._page_size = page_size
|
||||||
self._page_shift = (page_size - 1).bit_length()
|
self._page_shift = (page_size - 1).bit_length()
|
||||||
self._page_mask = page_size - 1
|
self._page_mask = page_size - 1
|
||||||
self._table: dict[int, int] = {} # va_page_number → pa_page_base
|
# vpn → list of (start_in_page, end_in_page, pa_at_offset_zero).
|
||||||
|
# pa_at_offset_zero is the PA that offset 0 of the page would map
|
||||||
|
# to under this region — i.e. translate(off) = pa_at_offset_zero
|
||||||
|
# + off when start <= off < end.
|
||||||
|
self._table: dict[int, list[tuple[int, int, int]]] = {}
|
||||||
self._overhead_ns = overhead_ns
|
self._overhead_ns = overhead_ns
|
||||||
|
|
||||||
@property
|
@property
|
||||||
@@ -46,21 +57,67 @@ class PeMMU:
|
|||||||
return len(self._table)
|
return len(self._table)
|
||||||
|
|
||||||
def map(self, va: int, pa: int, size: int) -> None:
|
def map(self, va: int, pa: int, size: int) -> None:
|
||||||
"""Register VA→PA mapping for a contiguous range."""
|
"""Register VA→PA mapping for a contiguous range.
|
||||||
for off in range(0, size, self._page_size):
|
|
||||||
vpn = (va + off) >> self._page_shift
|
Sub-page-aware: a single page can hold multiple disjoint regions,
|
||||||
self._table[vpn] = pa + off
|
each pointing to a different PA. Later map() calls APPEND a new
|
||||||
|
region; on overlap with an existing region, the new region wins
|
||||||
|
for the overlapping offsets (translate iterates in reverse so the
|
||||||
|
last write takes precedence — matches legacy single-PA behavior
|
||||||
|
when a full page is re-mapped).
|
||||||
|
"""
|
||||||
|
end_va = va + size
|
||||||
|
cur = va
|
||||||
|
while cur < end_va:
|
||||||
|
vpn = cur >> self._page_shift
|
||||||
|
page_base_va = vpn << self._page_shift
|
||||||
|
page_end_va = page_base_va + self._page_size
|
||||||
|
region_start = cur - page_base_va
|
||||||
|
region_end = min(end_va, page_end_va) - page_base_va
|
||||||
|
# PA seen at offset 0 of page if this region's mapping covered it
|
||||||
|
pa_at_offset_zero = pa + (cur - va) - region_start
|
||||||
|
self._table.setdefault(vpn, []).append(
|
||||||
|
(region_start, region_end, pa_at_offset_zero)
|
||||||
|
)
|
||||||
|
cur = page_base_va + region_end
|
||||||
|
|
||||||
def unmap(self, va: int, size: int) -> None:
|
def unmap(self, va: int, size: int) -> None:
|
||||||
"""Remove VA mapping for a contiguous range."""
|
"""Remove VA mapping for a contiguous range.
|
||||||
for off in range(0, size, self._page_size):
|
|
||||||
vpn = (va + off) >> self._page_shift
|
Drops any region whose extent is contained within the unmapped
|
||||||
self._table.pop(vpn, None)
|
range. Partial overlaps (region straddles the range boundary)
|
||||||
|
are left in place — caller is expected to unmap on the same
|
||||||
|
boundaries it mapped on.
|
||||||
|
"""
|
||||||
|
end_va = va + size
|
||||||
|
cur = va
|
||||||
|
while cur < end_va:
|
||||||
|
vpn = cur >> self._page_shift
|
||||||
|
page_base_va = vpn << self._page_shift
|
||||||
|
page_end_va = page_base_va + self._page_size
|
||||||
|
unmap_start = cur - page_base_va
|
||||||
|
unmap_end = min(end_va, page_end_va) - page_base_va
|
||||||
|
regions = self._table.get(vpn)
|
||||||
|
if regions is not None:
|
||||||
|
kept = [
|
||||||
|
r for r in regions
|
||||||
|
if not (r[0] >= unmap_start and r[1] <= unmap_end)
|
||||||
|
]
|
||||||
|
if kept:
|
||||||
|
self._table[vpn] = kept
|
||||||
|
else:
|
||||||
|
del self._table[vpn]
|
||||||
|
cur = page_base_va + unmap_end
|
||||||
|
|
||||||
def translate(self, va: int) -> int:
|
def translate(self, va: int) -> int:
|
||||||
"""Translate VA to PA. Raises PageFault if unmapped."""
|
"""Translate VA to PA. Raises PageFault if unmapped."""
|
||||||
vpn = va >> self._page_shift
|
vpn = va >> self._page_shift
|
||||||
pa_page_base = self._table.get(vpn)
|
regions = self._table.get(vpn)
|
||||||
if pa_page_base is None:
|
if regions is None:
|
||||||
|
raise PageFault(va)
|
||||||
|
offset = va & self._page_mask
|
||||||
|
# Iterate latest-first so newer map() calls win on overlap
|
||||||
|
for start, end, pa_at_offset_zero in reversed(regions):
|
||||||
|
if start <= offset < end:
|
||||||
|
return pa_at_offset_zero + offset
|
||||||
raise PageFault(va)
|
raise PageFault(va)
|
||||||
return pa_page_base + (va & self._page_mask)
|
|
||||||
|
|||||||
@@ -90,6 +90,11 @@ class KernelLaunchMsg:
|
|||||||
args: tuple[KernelArg, ...]
|
args: tuple[KernelArg, ...]
|
||||||
target_cubes: tuple[int, ...] | Literal["all"] = "all"
|
target_cubes: tuple[int, ...] | Literal["all"] = "all"
|
||||||
target_pe: int | tuple[int, ...] | Literal["all"] = "all"
|
target_pe: int | tuple[int, ...] | Literal["all"] = "all"
|
||||||
|
# ADR-0009 D5: synchronized kernel start. When set, each PE_CPU yields
|
||||||
|
# until env.now >= target_start_ns before beginning kernel execution,
|
||||||
|
# so every PE in a launch starts at the same simulated time regardless
|
||||||
|
# of its M_CPU dispatch path length. Stamped by M_CPU fan-out.
|
||||||
|
target_start_ns: float | None = None
|
||||||
msg_type: Literal["kernel_launch"] = "kernel_launch"
|
msg_type: Literal["kernel_launch"] = "kernel_launch"
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -67,6 +67,10 @@ class GraphEngine:
|
|||||||
spec=graph.spec,
|
spec=graph.spec,
|
||||||
memory_store=self._memory_store,
|
memory_store=self._memory_store,
|
||||||
op_logger=self._op_logger,
|
op_logger=self._op_logger,
|
||||||
|
node_overhead_ns={
|
||||||
|
nid: float(n.attrs.get("overhead_ns", 0.0))
|
||||||
|
for nid, n in graph.nodes.items()
|
||||||
|
},
|
||||||
)
|
)
|
||||||
self._components: dict[str, ComponentBase] = {
|
self._components: dict[str, ComponentBase] = {
|
||||||
node_id: ComponentRegistry.create(node, overrides, ctx)
|
node_id: ComponentRegistry.create(node, overrides, ctx)
|
||||||
|
|||||||
|
After Width: | Height: | Size: 39 KiB |
|
After Width: | Height: | Size: 71 KiB |
|
After Width: | Height: | Size: 38 KiB |
@@ -0,0 +1,34 @@
|
|||||||
|
algorithm,sip_topology,n_sips,n_elem,bytes_per_pe,bytes_per_sip,latency_ns
|
||||||
|
intercube_allreduce,ring_1d,6,8,16,256,3073.1299999999937
|
||||||
|
intercube_allreduce,ring_1d,6,32,64,1024,3079.8799999999947
|
||||||
|
intercube_allreduce,ring_1d,6,64,128,2048,3088.879999999992
|
||||||
|
intercube_allreduce,ring_1d,6,128,256,4096,3106.8799999999865
|
||||||
|
intercube_allreduce,ring_1d,6,512,1024,16384,3225.8799999999865
|
||||||
|
intercube_allreduce,ring_1d,6,1024,2048,32768,3391.8799999999865
|
||||||
|
intercube_allreduce,ring_1d,6,2048,4096,65536,3723.8799999999865
|
||||||
|
intercube_allreduce,ring_1d,6,4096,8192,131072,4387.879999999965
|
||||||
|
intercube_allreduce,ring_1d,6,8192,16384,262144,5715.879999999957
|
||||||
|
intercube_allreduce,ring_1d,6,16384,32768,524288,8371.879999999932
|
||||||
|
intercube_allreduce,ring_1d,6,32768,65536,1048576,13683.879999999903
|
||||||
|
intercube_allreduce,torus_2d,6,8,16,256,2190.4799999999923
|
||||||
|
intercube_allreduce,torus_2d,6,32,64,1024,2196.479999999993
|
||||||
|
intercube_allreduce,torus_2d,6,64,128,2048,2204.4799999999905
|
||||||
|
intercube_allreduce,torus_2d,6,128,256,4096,2220.479999999985
|
||||||
|
intercube_allreduce,torus_2d,6,512,1024,16384,2325.479999999985
|
||||||
|
intercube_allreduce,torus_2d,6,1024,2048,32768,2471.479999999985
|
||||||
|
intercube_allreduce,torus_2d,6,2048,4096,65536,2763.479999999985
|
||||||
|
intercube_allreduce,torus_2d,6,4096,8192,131072,3347.4799999999777
|
||||||
|
intercube_allreduce,torus_2d,6,8192,16384,262144,4515.4799999999705
|
||||||
|
intercube_allreduce,torus_2d,6,16384,32768,524288,6851.479999999952
|
||||||
|
intercube_allreduce,torus_2d,6,32768,65536,1048576,11523.479999999923
|
||||||
|
intercube_allreduce,mesh_2d_no_wrap,6,8,16,256,3508.4249999999993
|
||||||
|
intercube_allreduce,mesh_2d_no_wrap,6,32,64,1024,3515.55
|
||||||
|
intercube_allreduce,mesh_2d_no_wrap,6,64,128,2048,3525.0499999999975
|
||||||
|
intercube_allreduce,mesh_2d_no_wrap,6,128,256,4096,3544.049999999992
|
||||||
|
intercube_allreduce,mesh_2d_no_wrap,6,512,1024,16384,3667.049999999992
|
||||||
|
intercube_allreduce,mesh_2d_no_wrap,6,1024,2048,32768,3837.049999999992
|
||||||
|
intercube_allreduce,mesh_2d_no_wrap,6,2048,4096,65536,4177.049999999992
|
||||||
|
intercube_allreduce,mesh_2d_no_wrap,6,4096,8192,131072,4857.049999999959
|
||||||
|
intercube_allreduce,mesh_2d_no_wrap,6,8192,16384,262144,6217.049999999945
|
||||||
|
intercube_allreduce,mesh_2d_no_wrap,6,16384,32768,524288,8937.049999999937
|
||||||
|
intercube_allreduce,mesh_2d_no_wrap,6,32768,65536,1048576,14377.049999999872
|
||||||
|
|
After Width: | Height: | Size: 37 KiB |
|
After Width: | Height: | Size: 48 KiB |
|
After Width: | Height: | Size: 48 KiB |
|
After Width: | Height: | Size: 50 KiB |
|
After Width: | Height: | Size: 50 KiB |
|
After Width: | Height: | Size: 44 KiB |
|
After Width: | Height: | Size: 129 KiB |
@@ -0,0 +1,91 @@
|
|||||||
|
hop,label,size_bytes,path,total_ns
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,ipcq,31.1399999999976
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,raw,12.019999999996799
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,ipcq,32.6399999999976
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,raw,13.019999999996799
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,ipcq,34.1399999999976
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,raw,14.019999999996799
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,ipcq,35.6399999999976
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,raw,15.019999999996799
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,ipcq,38.6399999999976
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,raw,17.0199999999968
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,ipcq,41.6399999999976
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,raw,19.0199999999968
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,ipcq,53.6399999999976
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,raw,27.0199999999968
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,ipcq,77.6399999999976
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,raw,43.0199999999968
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,ipcq,125.64000000000306
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,raw,75.02000000000407
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,ipcq,149.64000000000306
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,raw,91.02000000000407
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,ipcq,31.1399999999976
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,raw,12.019999999996799
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,ipcq,32.6399999999976
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,raw,13.019999999996799
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,ipcq,34.1399999999976
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,raw,14.019999999996799
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,ipcq,35.6399999999976
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,raw,15.019999999996799
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,ipcq,38.6399999999976
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,raw,17.0199999999968
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,ipcq,41.6399999999976
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,raw,19.0199999999968
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,ipcq,53.6399999999976
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,raw,27.0199999999968
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,ipcq,77.6399999999976
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,raw,43.0199999999968
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,ipcq,125.64000000000306
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,raw,75.02000000000407
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,ipcq,149.64000000000306
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,raw,91.02000000000407
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,ipcq,67.15999999999804
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,raw,68.53999999999724
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,ipcq,68.65999999999804
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,raw,70.03999999999724
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,ipcq,70.15999999999804
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,raw,71.53999999999724
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,ipcq,71.65999999999804
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,raw,73.03999999999724
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,ipcq,74.65999999999804
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,raw,76.03999999999724
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,ipcq,77.65999999999804
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,raw,79.03999999999724
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,ipcq,89.65999999999804
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,raw,91.03999999999724
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,ipcq,113.65999999999804
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,raw,115.03999999999724
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,ipcq,161.65999999999985
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,raw,163.04000000000087
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,ipcq,185.65999999999985
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,raw,187.04000000000087
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,ipcq,87.15999999999804
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,raw,88.53999999999724
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,ipcq,88.65999999999804
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,raw,90.03999999999724
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,ipcq,90.15999999999804
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,raw,91.53999999999724
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,ipcq,91.65999999999804
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,raw,93.03999999999724
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,ipcq,94.65999999999804
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,raw,96.03999999999724
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,ipcq,97.65999999999804
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,raw,99.03999999999724
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,ipcq,109.65999999999804
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,raw,111.03999999999724
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,ipcq,133.65999999999804
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,raw,135.03999999999724
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,ipcq,181.65999999999985
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,raw,183.04000000000087
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,ipcq,205.65999999999985
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,raw,207.04000000000087
|
||||||
|
h5_inter_sip,"Inter-SIP (sip0 to sip1, same cube/pe)",128,ipcq,6.015000000003056
|
||||||
|
h5_inter_sip,"Inter-SIP (sip0 to sip1, same cube/pe)",256,ipcq,6.515000000003056
|
||||||
|
h5_inter_sip,"Inter-SIP (sip0 to sip1, same cube/pe)",384,ipcq,7.015000000003056
|
||||||
|
h5_inter_sip,"Inter-SIP (sip0 to sip1, same cube/pe)",512,ipcq,7.515000000003056
|
||||||
|
h5_inter_sip,"Inter-SIP (sip0 to sip1, same cube/pe)",768,ipcq,8.515000000003056
|
||||||
|
h5_inter_sip,"Inter-SIP (sip0 to sip1, same cube/pe)",1024,ipcq,9.515000000003056
|
||||||
|
h5_inter_sip,"Inter-SIP (sip0 to sip1, same cube/pe)",2048,ipcq,13.515000000003056
|
||||||
|
h5_inter_sip,"Inter-SIP (sip0 to sip1, same cube/pe)",4096,ipcq,21.515000000003056
|
||||||
|
h5_inter_sip,"Inter-SIP (sip0 to sip1, same cube/pe)",8192,ipcq,37.51499999999214
|
||||||
|
h5_inter_sip,"Inter-SIP (sip0 to sip1, same cube/pe)",10240,ipcq,45.51499999999214
|
||||||
|
@@ -22,13 +22,23 @@ from kernbench.ccl.sfr_config import configure_sfr_intercube_multisip
|
|||||||
from kernbench.policy.placement.dp import DPPolicy
|
from kernbench.policy.placement.dp import DPPolicy
|
||||||
|
|
||||||
|
|
||||||
def _sip_topo_dims(sip_topo: str, n_sips: int) -> tuple[int, int]:
|
def _sip_topo_dims(
|
||||||
|
sip_topo: str, n_sips: int,
|
||||||
|
spec_w: int | None = None, spec_h: int | None = None,
|
||||||
|
) -> tuple[int, int]:
|
||||||
if sip_topo == "ring_1d":
|
if sip_topo == "ring_1d":
|
||||||
return (0, 0)
|
return (0, 0)
|
||||||
|
if spec_w is not None and spec_h is not None:
|
||||||
|
if spec_w * spec_h != n_sips:
|
||||||
|
raise ValueError(
|
||||||
|
f"sip layout {spec_w}x{spec_h} != n_sips ({n_sips})"
|
||||||
|
)
|
||||||
|
return (spec_w, spec_h)
|
||||||
side = int(round(math.sqrt(n_sips)))
|
side = int(round(math.sqrt(n_sips)))
|
||||||
if side * side != n_sips:
|
if side * side != n_sips:
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
f"SIP topology '{sip_topo}' requires square n_sips, got {n_sips}"
|
f"SIP topology '{sip_topo}' requires square n_sips or "
|
||||||
|
f"explicit w/h in spec, got {n_sips}"
|
||||||
)
|
)
|
||||||
return (side, side)
|
return (side, side)
|
||||||
|
|
||||||
@@ -54,10 +64,13 @@ def run_allreduce(
|
|||||||
topo_name_to_kind = algo_module.TOPO_NAME_TO_KIND
|
topo_name_to_kind = algo_module.TOPO_NAME_TO_KIND
|
||||||
|
|
||||||
n_elem = int(cfg.get("n_elem", 8))
|
n_elem = int(cfg.get("n_elem", 8))
|
||||||
n_sips = int(spec.get("system", {}).get("sips", {}).get("count", 1))
|
sips_cfg = spec.get("system", {}).get("sips", {})
|
||||||
sip_topo = str(
|
n_sips = int(sips_cfg.get("count", 1))
|
||||||
spec.get("system", {}).get("sips", {}).get("topology", "ring_1d")
|
sip_topo = str(sips_cfg.get("topology", "ring_1d"))
|
||||||
)
|
spec_sip_w = sips_cfg.get("w")
|
||||||
|
spec_sip_h = sips_cfg.get("h")
|
||||||
|
spec_sip_w = int(spec_sip_w) if spec_sip_w is not None else None
|
||||||
|
spec_sip_h = int(spec_sip_h) if spec_sip_h is not None else None
|
||||||
|
|
||||||
cm = spec["sip"]["cube_mesh"]
|
cm = spec["sip"]["cube_mesh"]
|
||||||
cube_w = int(cm["w"])
|
cube_w = int(cm["w"])
|
||||||
@@ -65,7 +78,9 @@ def run_allreduce(
|
|||||||
n_cubes = cube_w * cube_h
|
n_cubes = cube_w * cube_h
|
||||||
|
|
||||||
sip_topo_kind = topo_name_to_kind.get(sip_topo, 0)
|
sip_topo_kind = topo_name_to_kind.get(sip_topo, 0)
|
||||||
sip_topo_w, sip_topo_h = _sip_topo_dims(sip_topo, n_sips)
|
sip_topo_w, sip_topo_h = _sip_topo_dims(
|
||||||
|
sip_topo, n_sips, spec_w=spec_sip_w, spec_h=spec_sip_h,
|
||||||
|
)
|
||||||
|
|
||||||
algo_name = cfg.get("algorithm", "allreduce")
|
algo_name = cfg.get("algorithm", "allreduce")
|
||||||
print(f"\n{'=' * 60}")
|
print(f"\n{'=' * 60}")
|
||||||
@@ -173,18 +188,36 @@ from kernbench.topology.builder import resolve_topology
|
|||||||
TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
|
TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
|
||||||
|
|
||||||
CONFIGS = [
|
CONFIGS = [
|
||||||
pytest.param("intercube_allreduce", "ring_1d", 2, id="ring_2sip"),
|
pytest.param(
|
||||||
pytest.param("intercube_allreduce", "torus_2d", 4, id="torus_4sip"),
|
"intercube_allreduce", "ring_1d", 6, None, None,
|
||||||
pytest.param("intercube_allreduce", "mesh_2d_no_wrap", 4, id="mesh_4sip"),
|
id="ring_6sip",
|
||||||
|
),
|
||||||
|
pytest.param(
|
||||||
|
"intercube_allreduce", "torus_2d", 6, 2, 3,
|
||||||
|
id="torus_6sip_2x3",
|
||||||
|
),
|
||||||
|
pytest.param(
|
||||||
|
"intercube_allreduce", "mesh_2d_no_wrap", 6, 2, 3,
|
||||||
|
id="mesh_6sip_2x3",
|
||||||
|
),
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
def _write_temp_configs(tmp_path, sip_topology, n_sips, algorithm):
|
def _write_temp_configs(
|
||||||
|
tmp_path, sip_topology, n_sips, algorithm, n_elem_override=None,
|
||||||
|
sip_w=None, sip_h=None,
|
||||||
|
):
|
||||||
"""Write temp topology.yaml and ccl.yaml with the given overrides."""
|
"""Write temp topology.yaml and ccl.yaml with the given overrides."""
|
||||||
with open(TOPOLOGY_PATH) as f:
|
with open(TOPOLOGY_PATH) as f:
|
||||||
topo_cfg = yaml.safe_load(f)
|
topo_cfg = yaml.safe_load(f)
|
||||||
topo_cfg["system"]["sips"]["count"] = n_sips
|
topo_cfg["system"]["sips"]["count"] = n_sips
|
||||||
topo_cfg["system"]["sips"]["topology"] = sip_topology
|
topo_cfg["system"]["sips"]["topology"] = sip_topology
|
||||||
|
if sip_w is not None and sip_h is not None:
|
||||||
|
topo_cfg["system"]["sips"]["w"] = int(sip_w)
|
||||||
|
topo_cfg["system"]["sips"]["h"] = int(sip_h)
|
||||||
|
else:
|
||||||
|
topo_cfg["system"]["sips"].pop("w", None)
|
||||||
|
topo_cfg["system"]["sips"].pop("h", None)
|
||||||
topo_path = tmp_path / "topology.yaml"
|
topo_path = tmp_path / "topology.yaml"
|
||||||
with open(topo_path, "w") as f:
|
with open(topo_path, "w") as f:
|
||||||
yaml.dump(topo_cfg, f, default_flow_style=False)
|
yaml.dump(topo_cfg, f, default_flow_style=False)
|
||||||
@@ -193,6 +226,15 @@ def _write_temp_configs(tmp_path, sip_topology, n_sips, algorithm):
|
|||||||
with open(ccl_path) as f:
|
with open(ccl_path) as f:
|
||||||
ccl_cfg = yaml.safe_load(f)
|
ccl_cfg = yaml.safe_load(f)
|
||||||
ccl_cfg["defaults"]["algorithm"] = algorithm
|
ccl_cfg["defaults"]["algorithm"] = algorithm
|
||||||
|
if n_elem_override is not None:
|
||||||
|
ccl_cfg.setdefault("algorithms", {}).setdefault(
|
||||||
|
algorithm, {},
|
||||||
|
)["n_elem"] = int(n_elem_override)
|
||||||
|
# Ensure IPCQ slot is big enough for the per-message payload.
|
||||||
|
per_msg_bytes = int(n_elem_override) * 2 # f16
|
||||||
|
default_slot = int(ccl_cfg["defaults"].get("slot_size", 4096))
|
||||||
|
if per_msg_bytes > default_slot:
|
||||||
|
ccl_cfg["defaults"]["slot_size"] = per_msg_bytes
|
||||||
tmp_ccl = tmp_path / "ccl.yaml"
|
tmp_ccl = tmp_path / "ccl.yaml"
|
||||||
with open(tmp_ccl, "w") as f:
|
with open(tmp_ccl, "w") as f:
|
||||||
yaml.dump(ccl_cfg, f, default_flow_style=False)
|
yaml.dump(ccl_cfg, f, default_flow_style=False)
|
||||||
@@ -200,10 +242,15 @@ def _write_temp_configs(tmp_path, sip_topology, n_sips, algorithm):
|
|||||||
return str(topo_path), str(tmp_ccl)
|
return str(topo_path), str(tmp_ccl)
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize("algorithm,sip_topology,n_sips", CONFIGS)
|
@pytest.mark.parametrize(
|
||||||
def test_allreduce(tmp_path, algorithm, sip_topology, n_sips):
|
"algorithm,sip_topology,n_sips,sip_w,sip_h", CONFIGS,
|
||||||
|
)
|
||||||
|
def test_allreduce(
|
||||||
|
tmp_path, algorithm, sip_topology, n_sips, sip_w, sip_h,
|
||||||
|
):
|
||||||
topo_path, ccl_path = _write_temp_configs(
|
topo_path, ccl_path = _write_temp_configs(
|
||||||
tmp_path, sip_topology, n_sips, algorithm,
|
tmp_path, sip_topology, n_sips, algorithm,
|
||||||
|
sip_w=sip_w, sip_h=sip_h,
|
||||||
)
|
)
|
||||||
topo = resolve_topology(topo_path)
|
topo = resolve_topology(topo_path)
|
||||||
engine = GraphEngine(topo.topology_obj, enable_data=True)
|
engine = GraphEngine(topo.topology_obj, enable_data=True)
|
||||||
@@ -220,3 +267,163 @@ def test_allreduce(tmp_path, algorithm, sip_topology, n_sips):
|
|||||||
algorithm=algorithm, ccl_yaml=ccl_path,
|
algorithm=algorithm, ccl_yaml=ccl_path,
|
||||||
)
|
)
|
||||||
assert result["ok_cubes"] > 0
|
assert result["ok_cubes"] > 0
|
||||||
|
|
||||||
|
|
||||||
|
# ── Latency sweep ─────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
# avoid 16 (== n_cubes, dim_map collision). Goes up to 1 MB per SIP:
|
||||||
|
# bytes_per_sip = n_cubes * n_elem * 2 = 32 * n_elem.
|
||||||
|
_SWEEP_N_ELEM = [
|
||||||
|
8, 32, 64, 128, 512, 1024, 2048,
|
||||||
|
4096, 8192, 16384, 32768,
|
||||||
|
]
|
||||||
|
_ELEM_BYTES_F16 = 2
|
||||||
|
|
||||||
|
|
||||||
|
def test_allreduce_latency_sweep(tmp_path):
|
||||||
|
"""Sweep n_elem across each SIP topology; record max(pe_exec_ns)
|
||||||
|
as the critical-path kernel latency. Emits CSV + PNG plots to
|
||||||
|
tests/allreduce_latency_plots/.
|
||||||
|
"""
|
||||||
|
import csv
|
||||||
|
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
from matplotlib.ticker import FuncFormatter
|
||||||
|
|
||||||
|
def _fmt_bytes(x, _pos):
|
||||||
|
"""Format tick as B / KB / MB."""
|
||||||
|
if x <= 0:
|
||||||
|
return "0"
|
||||||
|
if x >= 1024 * 1024:
|
||||||
|
return f"{x / (1024 * 1024):.0f} MB"
|
||||||
|
if x >= 1024:
|
||||||
|
return f"{x / 1024:.0f} KB"
|
||||||
|
return f"{x:.0f} B"
|
||||||
|
|
||||||
|
_bytes_fmt = FuncFormatter(_fmt_bytes)
|
||||||
|
|
||||||
|
out_dir = Path(__file__).parent / "allreduce_latency_plots"
|
||||||
|
out_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
records: list[dict] = []
|
||||||
|
|
||||||
|
# Apples-to-apples: same n_sips across all three topologies.
|
||||||
|
for algorithm, sip_topology, n_sips, sip_w, sip_h in [
|
||||||
|
("intercube_allreduce", "ring_1d", 6, None, None),
|
||||||
|
("intercube_allreduce", "torus_2d", 6, 2, 3),
|
||||||
|
("intercube_allreduce", "mesh_2d_no_wrap", 6, 2, 3),
|
||||||
|
]:
|
||||||
|
for n_elem in _SWEEP_N_ELEM:
|
||||||
|
sub = tmp_path / f"{sip_topology}_{n_elem}"
|
||||||
|
sub.mkdir()
|
||||||
|
topo_path, ccl_path = _write_temp_configs(
|
||||||
|
sub, sip_topology, n_sips, algorithm,
|
||||||
|
sip_w=sip_w, sip_h=sip_h,
|
||||||
|
n_elem_override=n_elem,
|
||||||
|
)
|
||||||
|
topo = resolve_topology(topo_path)
|
||||||
|
engine = GraphEngine(topo.topology_obj, enable_data=True)
|
||||||
|
spec = topo.topology_obj.spec
|
||||||
|
|
||||||
|
with RuntimeContext(
|
||||||
|
engine=engine,
|
||||||
|
target_device=DeviceSelector("all"),
|
||||||
|
correlation_id=f"sweep_{algorithm}_{sip_topology}_{n_elem}",
|
||||||
|
spec=spec,
|
||||||
|
) as ctx:
|
||||||
|
result = run_allreduce(
|
||||||
|
ctx, engine, spec,
|
||||||
|
algorithm=algorithm, ccl_yaml=ccl_path,
|
||||||
|
)
|
||||||
|
assert result["ok_cubes"] > 0
|
||||||
|
|
||||||
|
pe_exec_vals = [
|
||||||
|
float(tr.get("pe_exec_ns", 0.0) or 0.0)
|
||||||
|
for _, (_, tr) in engine._results.items()
|
||||||
|
if isinstance(tr, dict)
|
||||||
|
]
|
||||||
|
crit_ns = max(pe_exec_vals) if pe_exec_vals else 0.0
|
||||||
|
|
||||||
|
cm = spec["sip"]["cube_mesh"]
|
||||||
|
n_cubes = int(cm["w"]) * int(cm["h"])
|
||||||
|
bytes_per_sip = n_cubes * n_elem * _ELEM_BYTES_F16
|
||||||
|
# pe="replicate" + num_pes=1 → one active PE per cube owns
|
||||||
|
# the whole cube row. Per-PE bytes == per-cube-tile bytes ==
|
||||||
|
# per-message bytes over the IPCQ fabric.
|
||||||
|
bytes_per_pe = n_elem * _ELEM_BYTES_F16
|
||||||
|
|
||||||
|
records.append({
|
||||||
|
"algorithm": algorithm,
|
||||||
|
"sip_topology": sip_topology,
|
||||||
|
"n_sips": n_sips,
|
||||||
|
"n_elem": n_elem,
|
||||||
|
"bytes_per_pe": bytes_per_pe,
|
||||||
|
"bytes_per_sip": bytes_per_sip,
|
||||||
|
"latency_ns": crit_ns,
|
||||||
|
})
|
||||||
|
print(
|
||||||
|
f"[{sip_topology:<16} n_sips={n_sips} n_elem={n_elem:>5} "
|
||||||
|
f"bytes/pe={bytes_per_pe:>7} bytes/sip={bytes_per_sip:>9}] "
|
||||||
|
f"pe_exec_max = {crit_ns:8.1f} ns"
|
||||||
|
)
|
||||||
|
|
||||||
|
with open(out_dir / "summary.csv", "w", newline="", encoding="utf-8") as f:
|
||||||
|
w = csv.DictWriter(f, fieldnames=[
|
||||||
|
"algorithm", "sip_topology", "n_sips", "n_elem",
|
||||||
|
"bytes_per_pe", "bytes_per_sip", "latency_ns",
|
||||||
|
])
|
||||||
|
w.writeheader()
|
||||||
|
for r in records:
|
||||||
|
w.writerow(r)
|
||||||
|
|
||||||
|
topologies = sorted({r["sip_topology"] for r in records})
|
||||||
|
# Per-topology plots, log-scale x-axis = bytes per PE.
|
||||||
|
for topo_name in topologies:
|
||||||
|
rs = sorted(
|
||||||
|
[r for r in records if r["sip_topology"] == topo_name],
|
||||||
|
key=lambda r: r["bytes_per_pe"],
|
||||||
|
)
|
||||||
|
xs = [r["bytes_per_pe"] for r in rs]
|
||||||
|
ys = [r["latency_ns"] for r in rs]
|
||||||
|
title = (
|
||||||
|
f"Allreduce latency — {topo_name} "
|
||||||
|
f"(n_sips={rs[0]['n_sips']})"
|
||||||
|
)
|
||||||
|
fig, ax = plt.subplots(figsize=(8, 5))
|
||||||
|
ax.plot(xs, ys, marker="o", color="tab:blue")
|
||||||
|
ax.set_xscale("log", base=2)
|
||||||
|
ax.set_xlabel("Bytes per PE (log scale)")
|
||||||
|
ax.set_ylabel("max pe_exec_ns (critical path)")
|
||||||
|
ax.set_title(title)
|
||||||
|
ax.grid(True, alpha=0.3)
|
||||||
|
ax.xaxis.set_major_formatter(_bytes_fmt)
|
||||||
|
fig.tight_layout()
|
||||||
|
fig.savefig(out_dir / f"{topo_name}.png", dpi=120)
|
||||||
|
plt.close(fig)
|
||||||
|
|
||||||
|
colors = {"ring_1d": "tab:blue", "torus_2d": "tab:orange",
|
||||||
|
"mesh_2d_no_wrap": "tab:green"}
|
||||||
|
fig, ax = plt.subplots(figsize=(9, 6))
|
||||||
|
for topo_name in topologies:
|
||||||
|
rs = sorted(
|
||||||
|
[r for r in records if r["sip_topology"] == topo_name],
|
||||||
|
key=lambda r: r["bytes_per_pe"],
|
||||||
|
)
|
||||||
|
ax.plot(
|
||||||
|
[r["bytes_per_pe"] for r in rs],
|
||||||
|
[r["latency_ns"] for r in rs],
|
||||||
|
marker="o",
|
||||||
|
label=f"{topo_name} (n_sips={rs[0]['n_sips']})",
|
||||||
|
color=colors.get(topo_name),
|
||||||
|
)
|
||||||
|
ax.set_xscale("log", base=2)
|
||||||
|
ax.set_xlabel("Bytes per PE (log scale)")
|
||||||
|
ax.set_ylabel("max pe_exec_ns (critical path)")
|
||||||
|
ax.set_title("Multi-device allreduce latency by topology")
|
||||||
|
ax.grid(True, alpha=0.3)
|
||||||
|
ax.legend()
|
||||||
|
ax.xaxis.set_major_formatter(_bytes_fmt)
|
||||||
|
fig.tight_layout()
|
||||||
|
fig.savefig(out_dir / "overview.png", dpi=120)
|
||||||
|
plt.close(fig)
|
||||||
|
|
||||||
|
print(f"\nWrote {out_dir / 'overview.png'}")
|
||||||
|
|||||||
@@ -0,0 +1,194 @@
|
|||||||
|
"""ADR-0009 D5 invariant: all PEs targeted by a single kernel launch MUST
|
||||||
|
begin executing the kernel body at the same simulated time, regardless of
|
||||||
|
their dispatch path length.
|
||||||
|
|
||||||
|
These tests directly verify the invariant by capturing per-PE state at the
|
||||||
|
top of `_execute_kernel`:
|
||||||
|
|
||||||
|
test_no_pe_arrives_after_target_start_ns
|
||||||
|
Asserts: for every PE that enters _execute_kernel during a multi-cube
|
||||||
|
launch, `env.now` at entry must be <= target_start_ns. Otherwise the
|
||||||
|
PE's barrier yield would be a no-op and `pe_exec_start` would be set
|
||||||
|
late, breaking the D5 "same simulated time" mandate.
|
||||||
|
|
||||||
|
test_all_pes_have_identical_pe_exec_start
|
||||||
|
Asserts: every PE's `pe_exec_start` (the value of `env.now` recorded
|
||||||
|
immediately AFTER the barrier yield) is identical across all PEs in
|
||||||
|
the launch.
|
||||||
|
|
||||||
|
Both tests are expected to FAIL today and become the regression check the
|
||||||
|
Phase 2 D5 predictor + fallback fix must make pass.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from kernbench.policy.placement.dp import DPPolicy
|
||||||
|
from kernbench.runtime_api.context import RuntimeContext
|
||||||
|
from kernbench.runtime_api.types import DeviceSelector
|
||||||
|
from kernbench.sim_engine.engine import GraphEngine
|
||||||
|
from kernbench.topology.builder import resolve_topology
|
||||||
|
|
||||||
|
TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
|
||||||
|
|
||||||
|
|
||||||
|
def _capture_per_pe_d5_state():
|
||||||
|
"""Monkey-patch PeCpuComponent._execute_kernel to record, per PE:
|
||||||
|
|
||||||
|
- entry_now: env.now at function entry (before any yield)
|
||||||
|
- target_start_ns: the value carried by the request
|
||||||
|
- barrier_yielded: True if the barrier yield fired (entry_now < target)
|
||||||
|
- pe_exec_start: env.now immediately after the barrier check
|
||||||
|
(i.e. the value the original code sets)
|
||||||
|
|
||||||
|
Returns (records: list[dict], restore: callable).
|
||||||
|
"""
|
||||||
|
import kernbench.components.builtin.pe_cpu as pe_cpu_mod
|
||||||
|
|
||||||
|
records: list[dict] = []
|
||||||
|
original = pe_cpu_mod.PeCpuComponent._execute_kernel
|
||||||
|
|
||||||
|
def patched(self, env, txn):
|
||||||
|
request = txn.request
|
||||||
|
target_start = getattr(request, "target_start_ns", None)
|
||||||
|
entry_now = float(env.now)
|
||||||
|
rec = {
|
||||||
|
"node_id": self.node.id,
|
||||||
|
"entry_now": entry_now,
|
||||||
|
"target_start_ns": (
|
||||||
|
float(target_start) if target_start is not None else None
|
||||||
|
),
|
||||||
|
"barrier_yielded": (
|
||||||
|
target_start is not None
|
||||||
|
and float(target_start) > entry_now
|
||||||
|
),
|
||||||
|
"pe_exec_start": None, # filled below by sniff
|
||||||
|
"late_ns": (
|
||||||
|
None if target_start is None
|
||||||
|
else max(0.0, entry_now - float(target_start))
|
||||||
|
),
|
||||||
|
}
|
||||||
|
records.append(rec)
|
||||||
|
|
||||||
|
# We can't easily inject a callback at the original's
|
||||||
|
# `pe_exec_start = env.now` line without rewriting it. Approximate:
|
||||||
|
# if the original yields the barrier, env.now after the yield is
|
||||||
|
# target_start_ns; otherwise pe_exec_start is entry_now (skipped).
|
||||||
|
if rec["barrier_yielded"]:
|
||||||
|
rec["pe_exec_start"] = float(target_start)
|
||||||
|
else:
|
||||||
|
rec["pe_exec_start"] = entry_now
|
||||||
|
|
||||||
|
yield from original(self, env, txn)
|
||||||
|
|
||||||
|
pe_cpu_mod.PeCpuComponent._execute_kernel = patched
|
||||||
|
|
||||||
|
def restore():
|
||||||
|
pe_cpu_mod.PeCpuComponent._execute_kernel = original
|
||||||
|
|
||||||
|
return records, restore
|
||||||
|
|
||||||
|
|
||||||
|
def _run_multicube_launch():
|
||||||
|
"""Drive a no-op kernel launch across all 16 cubes x 8 PEs and return
|
||||||
|
the per-PE D5 records collected by the monkey-patch."""
|
||||||
|
records, restore = _capture_per_pe_d5_state()
|
||||||
|
try:
|
||||||
|
topo = resolve_topology(str(TOPOLOGY_PATH))
|
||||||
|
engine = GraphEngine(topo.topology_obj, enable_data=True)
|
||||||
|
spec = topo.topology_obj.spec
|
||||||
|
|
||||||
|
with RuntimeContext(
|
||||||
|
engine=engine, target_device=DeviceSelector("all"),
|
||||||
|
correlation_id="d5_barrier", spec=spec,
|
||||||
|
) as ctx:
|
||||||
|
dp = DPPolicy(
|
||||||
|
cube="row_wise", pe="column_wise",
|
||||||
|
num_cubes=16, num_pes=8,
|
||||||
|
)
|
||||||
|
|
||||||
|
def kernel(t_ptr, n_elem, tl):
|
||||||
|
pass # no-op
|
||||||
|
|
||||||
|
ctx.ahbm.set_device(0)
|
||||||
|
t = ctx.zeros(
|
||||||
|
(16, 8 * 64), dtype="f16", dp=dp, name="probe",
|
||||||
|
)
|
||||||
|
t.copy_(ctx.from_numpy(
|
||||||
|
np.zeros((16, 8 * 64), dtype=np.float16),
|
||||||
|
))
|
||||||
|
|
||||||
|
pending = ctx.launch(
|
||||||
|
"d5_probe", kernel, t, 64, _defer_wait=True,
|
||||||
|
)
|
||||||
|
for h, _sip, meta in pending:
|
||||||
|
ctx.wait(h, _meta=meta)
|
||||||
|
finally:
|
||||||
|
restore()
|
||||||
|
return records
|
||||||
|
|
||||||
|
|
||||||
|
def test_no_pe_arrives_after_target_start_ns():
|
||||||
|
"""ADR-0009 D5: no PE may enter `_execute_kernel` after target_start_ns.
|
||||||
|
|
||||||
|
Today this fails because IO_CPU's predictor under-shoots actual
|
||||||
|
dispatch latency for far cubes (cube4, cube9-15). Phase 2 fix:
|
||||||
|
chain-aware predictor in IO_CPU + monotonic upward re-stamp in M_CPU.
|
||||||
|
"""
|
||||||
|
records = _run_multicube_launch()
|
||||||
|
assert records, "expected per-PE _execute_kernel records"
|
||||||
|
|
||||||
|
late = [
|
||||||
|
r for r in records
|
||||||
|
if r["target_start_ns"] is not None
|
||||||
|
and r["late_ns"] is not None
|
||||||
|
and r["late_ns"] > 1e-6
|
||||||
|
]
|
||||||
|
|
||||||
|
if late:
|
||||||
|
# Provide actionable diagnostic in the failure.
|
||||||
|
worst = sorted(late, key=lambda r: -r["late_ns"])[:5]
|
||||||
|
details = "\n".join(
|
||||||
|
f" {r['node_id']}: late by {r['late_ns']:.2f} ns "
|
||||||
|
f"(entry_now={r['entry_now']:.2f}, "
|
||||||
|
f"target_start_ns={r['target_start_ns']:.2f})"
|
||||||
|
for r in worst
|
||||||
|
)
|
||||||
|
pytest.fail(
|
||||||
|
f"ADR-0009 D5 violated: {len(late)}/{len(records)} PEs "
|
||||||
|
f"entered _execute_kernel AFTER target_start_ns "
|
||||||
|
f"(barrier yield silently skipped). "
|
||||||
|
f"Worst offenders:\n{details}"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_all_pes_have_identical_pe_exec_start():
|
||||||
|
"""ADR-0009 D5: every PE's pe_exec_start must be identical.
|
||||||
|
|
||||||
|
With D5 honored, every PE either yields to target_start_ns (start =
|
||||||
|
target_start_ns) or, if late, would still be aligned by the M_CPU
|
||||||
|
upward re-stamp (Phase 2). Today: 75/128 PEs in this launch have
|
||||||
|
distinct pe_exec_start values because they skipped the barrier.
|
||||||
|
"""
|
||||||
|
records = _run_multicube_launch()
|
||||||
|
assert records, "expected per-PE _execute_kernel records"
|
||||||
|
|
||||||
|
starts = sorted({round(r["pe_exec_start"], 6) for r in records})
|
||||||
|
if len(starts) > 1:
|
||||||
|
spread = max(starts) - min(starts)
|
||||||
|
# Distribution of how many PEs at each distinct start time
|
||||||
|
from collections import Counter
|
||||||
|
bucket = Counter(round(r["pe_exec_start"], 6) for r in records)
|
||||||
|
details = "\n".join(
|
||||||
|
f" pe_exec_start={t}: {n} PEs"
|
||||||
|
for t, n in sorted(bucket.items())
|
||||||
|
)
|
||||||
|
pytest.fail(
|
||||||
|
f"ADR-0009 D5 violated: PEs have {len(starts)} distinct "
|
||||||
|
f"pe_exec_start values (spread = {spread:.2f} ns); "
|
||||||
|
f"D5 mandates a single common value. "
|
||||||
|
f"Distribution:\n{details}"
|
||||||
|
)
|
||||||
@@ -0,0 +1,62 @@
|
|||||||
|
"""ADR-0009 D5: synchronized launch barrier.
|
||||||
|
|
||||||
|
M_CPU stamps KernelLaunchMsg with target_start_ns = env.now + max path
|
||||||
|
latency; PE_CPU yields until that time before recording pe_exec_start.
|
||||||
|
Every PE in a single launch MUST begin kernel execution at the same
|
||||||
|
env.now regardless of its dispatch path length.
|
||||||
|
|
||||||
|
We verify this indirectly: for a no-op kernel, pe_exec_ns = env.now -
|
||||||
|
pe_exec_start. If every PE's pe_exec_start is identical and every PE
|
||||||
|
runs the same no-op body, every pe_exec_ns value must be identical.
|
||||||
|
Without D5, pe_exec_start varies by dispatch-path length and so does
|
||||||
|
pe_exec_ns.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
from kernbench.policy.placement.dp import DPPolicy
|
||||||
|
from kernbench.runtime_api.context import RuntimeContext
|
||||||
|
from kernbench.runtime_api.types import DeviceSelector
|
||||||
|
from kernbench.sim_engine.engine import GraphEngine
|
||||||
|
from kernbench.topology.builder import resolve_topology
|
||||||
|
|
||||||
|
TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
|
||||||
|
|
||||||
|
|
||||||
|
def test_kernel_launch_sync_all_pes_have_equal_exec_time():
|
||||||
|
"""No-op kernel: every PE's pe_exec_ns must be identical under D5."""
|
||||||
|
topo = resolve_topology(str(TOPOLOGY_PATH))
|
||||||
|
engine = GraphEngine(topo.topology_obj, enable_data=True)
|
||||||
|
spec = topo.topology_obj.spec
|
||||||
|
|
||||||
|
with RuntimeContext(engine=engine, target_device=DeviceSelector("all"),
|
||||||
|
correlation_id="sync_test", spec=spec) as ctx:
|
||||||
|
dp = DPPolicy(cube="row_wise", pe="column_wise",
|
||||||
|
num_cubes=16, num_pes=8)
|
||||||
|
|
||||||
|
def kernel(t_ptr, n_elem, tl):
|
||||||
|
pass # no-op
|
||||||
|
|
||||||
|
ctx.ahbm.set_device(0)
|
||||||
|
t = ctx.zeros((16, 8 * 64), dtype="f16", dp=dp, name="probe")
|
||||||
|
t.copy_(ctx.from_numpy(np.zeros((16, 8 * 64), dtype=np.float16)))
|
||||||
|
|
||||||
|
pending = ctx.launch("sync_probe", kernel, t, 64, _defer_wait=True)
|
||||||
|
for h, _sip, meta in pending:
|
||||||
|
ctx.wait(h, _meta=meta)
|
||||||
|
|
||||||
|
pe_exec_vals = []
|
||||||
|
for h, _sip, _meta in pending:
|
||||||
|
_, trace = engine.get_completion(h)
|
||||||
|
if trace and trace.get("pe_exec_ns") is not None:
|
||||||
|
pe_exec_vals.append(float(trace["pe_exec_ns"]))
|
||||||
|
|
||||||
|
assert pe_exec_vals, "expected completion traces with pe_exec_ns"
|
||||||
|
spread = max(pe_exec_vals) - min(pe_exec_vals)
|
||||||
|
assert spread < 1e-6, (
|
||||||
|
f"ADR-0009 D5 violated: pe_exec_ns spread across PEs = "
|
||||||
|
f"{spread:.6f} ns (expected 0). Values: {pe_exec_vals}"
|
||||||
|
)
|
||||||
@@ -0,0 +1,741 @@
|
|||||||
|
"""Diagnostic for the inter-cube RAW > IPCQ asymmetry on h3/h4 plots.
|
||||||
|
|
||||||
|
Single-shot run at h3 (sip0.cube0.pe0 -> sip0.cube1.pe0), nbytes=4096.
|
||||||
|
|
||||||
|
Captures per-PE pe_exec_ns and the actual path / drain / per-node overhead
|
||||||
|
breakdown for the RAW sub-txn (PE_DMA -> remote HBM_CTRL) vs the IPCQ
|
||||||
|
outbound sub-txn (PE_DMA -> peer PE_DMA), so we can localize the gap to
|
||||||
|
one of:
|
||||||
|
(a) drain at HBM-BW (RAW) vs fabric-BW (IPCQ)
|
||||||
|
(b) path-length / per-node overhead asymmetry
|
||||||
|
(c) RAW SRC paying tl.load (local HBM read) on top of remote tl.store
|
||||||
|
while IPCQ DST only pays inbound traversal+drain.
|
||||||
|
|
||||||
|
Phase 1 / test-only. No production code is modified.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from kernbench.ccl.install import load_ccl_config, resolve_algorithm_config
|
||||||
|
from kernbench.ccl.sfr_config import configure_sfr_intercube_multisip
|
||||||
|
from kernbench.policy.placement.dp import DPPolicy
|
||||||
|
from kernbench.runtime_api.context import RuntimeContext
|
||||||
|
from kernbench.runtime_api.types import DeviceSelector
|
||||||
|
from kernbench.sim_engine.engine import GraphEngine
|
||||||
|
from kernbench.topology.builder import resolve_topology
|
||||||
|
|
||||||
|
TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
|
||||||
|
|
||||||
|
import os
|
||||||
|
|
||||||
|
# Allow the test to be re-run for h4 (inter-cube vertical) at multiple sizes
|
||||||
|
# to investigate why IPCQ slope flattens past 8192 B (path may differ).
|
||||||
|
NBYTES = int(os.environ.get("DIAG_NBYTES", "4096"))
|
||||||
|
ELEM_BYTES = 2
|
||||||
|
N_ELEM = NBYTES // ELEM_BYTES
|
||||||
|
N_CUBES = 16
|
||||||
|
N_PES = 8
|
||||||
|
HOP = os.environ.get("DIAG_HOP", "h3")
|
||||||
|
if HOP == "h4":
|
||||||
|
SRC = (0, 0, 0)
|
||||||
|
DST = (0, 4, 0) # h4 inter-cube vertical
|
||||||
|
else:
|
||||||
|
SRC = (0, 0, 0)
|
||||||
|
DST = (0, 1, 0) # h3 inter-cube horizontal
|
||||||
|
|
||||||
|
|
||||||
|
# ── Per-PE pe_exec_ns capture via monkey-patch ───────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def _install_barrier_capture():
|
||||||
|
"""Wrap PeCpuComponent._execute_kernel to log, for every PE that
|
||||||
|
enters: env.now at entry, target_start_ns the request carried,
|
||||||
|
whether the barrier yield fired (i.e. env.now < target_start_ns),
|
||||||
|
and env.now at pe_exec_start.
|
||||||
|
"""
|
||||||
|
import kernbench.components.builtin.pe_cpu as pe_cpu_mod
|
||||||
|
|
||||||
|
log: list[dict] = []
|
||||||
|
original = pe_cpu_mod.PeCpuComponent._execute_kernel
|
||||||
|
|
||||||
|
def patched(self, env, txn):
|
||||||
|
request = txn.request
|
||||||
|
target_start = getattr(request, "target_start_ns", None)
|
||||||
|
entry_now = float(env.now)
|
||||||
|
log_entry = {
|
||||||
|
"node_id": self.node.id,
|
||||||
|
"entry_now": entry_now,
|
||||||
|
"target_start_ns": (
|
||||||
|
float(target_start) if target_start is not None else None
|
||||||
|
),
|
||||||
|
"barrier_skipped": (
|
||||||
|
target_start is None
|
||||||
|
or float(target_start) <= entry_now
|
||||||
|
),
|
||||||
|
"delta_late_ns": (
|
||||||
|
None if target_start is None
|
||||||
|
else max(0.0, entry_now - float(target_start))
|
||||||
|
),
|
||||||
|
}
|
||||||
|
log.append(log_entry)
|
||||||
|
yield from original(self, env, txn)
|
||||||
|
|
||||||
|
pe_cpu_mod.PeCpuComponent._execute_kernel = patched
|
||||||
|
|
||||||
|
def restore():
|
||||||
|
pe_cpu_mod.PeCpuComponent._execute_kernel = original
|
||||||
|
|
||||||
|
return log, restore
|
||||||
|
|
||||||
|
|
||||||
|
def _install_per_pe_capture():
|
||||||
|
"""Wrap PeCpuComponent._execute_kernel so we record (node_id ->
|
||||||
|
pe_exec_ns) for every PE that executes a kernel during the run.
|
||||||
|
|
||||||
|
Returns (capture_dict, restore_callable).
|
||||||
|
"""
|
||||||
|
import kernbench.components.builtin.pe_cpu as pe_cpu_mod
|
||||||
|
|
||||||
|
captured: dict[str, float] = {}
|
||||||
|
original = pe_cpu_mod.PeCpuComponent._execute_kernel
|
||||||
|
|
||||||
|
def patched(self, env, txn):
|
||||||
|
gen = original(self, env, txn)
|
||||||
|
try:
|
||||||
|
value = yield from gen
|
||||||
|
finally:
|
||||||
|
v = txn.result_data.get("pe_exec_ns")
|
||||||
|
if v is not None:
|
||||||
|
captured[self.node.id] = float(v)
|
||||||
|
return value
|
||||||
|
|
||||||
|
pe_cpu_mod.PeCpuComponent._execute_kernel = patched
|
||||||
|
|
||||||
|
def restore():
|
||||||
|
pe_cpu_mod.PeCpuComponent._execute_kernel = original
|
||||||
|
|
||||||
|
return captured, restore
|
||||||
|
|
||||||
|
|
||||||
|
def _install_recv_capture(target_node_id: str):
|
||||||
|
"""Wrap PeIpcqComponent._handle_recv to log entry/exit times and the
|
||||||
|
peer_head_cache/my_tail values seen at the start.
|
||||||
|
|
||||||
|
This pins down whether recv ever blocked on a wait_event, or whether
|
||||||
|
it consumed without waiting (i.e. peer_head_cache > my_tail at entry).
|
||||||
|
"""
|
||||||
|
import kernbench.components.builtin.pe_ipcq as pe_ipcq_mod
|
||||||
|
|
||||||
|
log: list[dict] = []
|
||||||
|
original = pe_ipcq_mod.PeIpcqComponent._handle_recv
|
||||||
|
|
||||||
|
def patched(self, env, req, cmd):
|
||||||
|
if self.node.id != target_node_id:
|
||||||
|
yield from original(self, env, req, cmd)
|
||||||
|
return
|
||||||
|
# Snapshot state before dispatch
|
||||||
|
d = cmd.direction
|
||||||
|
qp = self._queue_pairs.get(d, {})
|
||||||
|
log.append({
|
||||||
|
"phase": "enter",
|
||||||
|
"t": float(env.now),
|
||||||
|
"direction": d,
|
||||||
|
"peer_head_cache": qp.get("peer_head_cache"),
|
||||||
|
"my_tail": qp.get("my_tail"),
|
||||||
|
})
|
||||||
|
yield from original(self, env, req, cmd)
|
||||||
|
qp = self._queue_pairs.get(d, {})
|
||||||
|
log.append({
|
||||||
|
"phase": "exit",
|
||||||
|
"t": float(env.now),
|
||||||
|
"direction": d,
|
||||||
|
"peer_head_cache": qp.get("peer_head_cache"),
|
||||||
|
"my_tail": qp.get("my_tail"),
|
||||||
|
})
|
||||||
|
|
||||||
|
pe_ipcq_mod.PeIpcqComponent._handle_recv = patched
|
||||||
|
|
||||||
|
def restore():
|
||||||
|
pe_ipcq_mod.PeIpcqComponent._handle_recv = original
|
||||||
|
|
||||||
|
return log, restore
|
||||||
|
|
||||||
|
|
||||||
|
def _install_meta_arrival_capture(target_node_id: str):
|
||||||
|
"""Log every IpcqMetaArrival that lands on ``target_node_id`` PE_IPCQ.
|
||||||
|
|
||||||
|
Records (env_now, sender_seq, dst_addr, matched_direction,
|
||||||
|
peer_head_cache_before, my_tail_before).
|
||||||
|
"""
|
||||||
|
import kernbench.components.builtin.pe_ipcq as pe_ipcq_mod
|
||||||
|
|
||||||
|
log: list[dict] = []
|
||||||
|
original = pe_ipcq_mod.PeIpcqComponent._handle_meta_arrival
|
||||||
|
|
||||||
|
def patched(self, msg):
|
||||||
|
if self.node.id == target_node_id:
|
||||||
|
token = msg.token
|
||||||
|
now = float(self._env.now) if hasattr(self, "_env") else 0.0
|
||||||
|
# _env is not stored on the component; use ctx? Fall back to
|
||||||
|
# introspection via self._inbox._env (SimPy stores reference).
|
||||||
|
try:
|
||||||
|
now = float(self._inbox._env.now)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
entry = {
|
||||||
|
"t": now,
|
||||||
|
"sender_seq": getattr(token, "sender_seq", None),
|
||||||
|
"dst_addr": getattr(token, "dst_addr", None),
|
||||||
|
"src_sip": getattr(token, "src_sip", None),
|
||||||
|
"src_cube": getattr(token, "src_cube", None),
|
||||||
|
"src_pe": getattr(token, "src_pe", None),
|
||||||
|
"src_direction": getattr(token, "src_direction", None),
|
||||||
|
"nbytes": getattr(token, "nbytes", None),
|
||||||
|
"matched_direction": None,
|
||||||
|
"peer_head_cache_before": {},
|
||||||
|
"my_tail_before": {},
|
||||||
|
}
|
||||||
|
for d, qp in self._queue_pairs.items():
|
||||||
|
entry["peer_head_cache_before"][d] = qp["peer_head_cache"]
|
||||||
|
entry["my_tail_before"][d] = qp["my_tail"]
|
||||||
|
base = qp["my_rx_base_pa"]
|
||||||
|
size = qp["n_slots"] * qp["slot_size"]
|
||||||
|
if base <= entry["dst_addr"] < base + size:
|
||||||
|
entry["matched_direction"] = d
|
||||||
|
log.append(entry)
|
||||||
|
return original(self, msg)
|
||||||
|
|
||||||
|
pe_ipcq_mod.PeIpcqComponent._handle_meta_arrival = patched
|
||||||
|
|
||||||
|
def restore():
|
||||||
|
pe_ipcq_mod.PeIpcqComponent._handle_meta_arrival = original
|
||||||
|
|
||||||
|
return log, restore
|
||||||
|
|
||||||
|
|
||||||
|
def _snapshot_qp_state(engine, target_node_id: str) -> dict:
|
||||||
|
"""Snapshot every direction's qp state on the target PE_IPCQ now.
|
||||||
|
|
||||||
|
Captures peer_head_cache, my_tail, my_rx_base_pa, n_slots, slot_size
|
||||||
|
for each installed direction.
|
||||||
|
"""
|
||||||
|
comp = engine._components.get(target_node_id)
|
||||||
|
if comp is None:
|
||||||
|
return {}
|
||||||
|
return {
|
||||||
|
d: {
|
||||||
|
"peer_head_cache": qp["peer_head_cache"],
|
||||||
|
"my_tail": qp["my_tail"],
|
||||||
|
"my_rx_base_pa": qp["my_rx_base_pa"],
|
||||||
|
"n_slots": qp["n_slots"],
|
||||||
|
"slot_size": qp["slot_size"],
|
||||||
|
"rx_range": (
|
||||||
|
qp["my_rx_base_pa"],
|
||||||
|
qp["my_rx_base_pa"] + qp["n_slots"] * qp["slot_size"],
|
||||||
|
),
|
||||||
|
}
|
||||||
|
for d, qp in comp.queue_pairs.items()
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# ── Path / drain breakdown using engine ctx ──────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def _path_breakdown(ctx, path: list[str], nbytes: int) -> dict:
|
||||||
|
edge_total_ns = 0.0
|
||||||
|
edge_details = []
|
||||||
|
min_bw = float("inf")
|
||||||
|
for i in range(len(path) - 1):
|
||||||
|
edge = ctx.edge_map.get((path[i], path[i + 1]))
|
||||||
|
if edge is None:
|
||||||
|
edge_details.append((path[i], path[i + 1], None, None, None))
|
||||||
|
continue
|
||||||
|
prop_ns = edge.distance_mm * ctx.ns_per_mm
|
||||||
|
edge_total_ns += prop_ns
|
||||||
|
bw = getattr(edge, "bw_gbs", None) or 0.0
|
||||||
|
if bw > 0 and bw < min_bw:
|
||||||
|
min_bw = bw
|
||||||
|
edge_details.append(
|
||||||
|
(path[i], path[i + 1], edge.distance_mm, prop_ns, bw),
|
||||||
|
)
|
||||||
|
|
||||||
|
overhead_total_ns = 0.0
|
||||||
|
overhead_details = []
|
||||||
|
for nid in path:
|
||||||
|
oh = float(ctx.node_overhead_ns.get(nid, 0.0))
|
||||||
|
overhead_total_ns += oh
|
||||||
|
overhead_details.append((nid, oh))
|
||||||
|
|
||||||
|
drain_ns = ctx.compute_drain_ns(path, nbytes)
|
||||||
|
bottleneck_bw = None if min_bw == float("inf") else min_bw
|
||||||
|
|
||||||
|
return {
|
||||||
|
"path": path,
|
||||||
|
"edges": edge_details,
|
||||||
|
"edge_total_ns": edge_total_ns,
|
||||||
|
"overheads": overhead_details,
|
||||||
|
"overhead_total_ns": overhead_total_ns,
|
||||||
|
"drain_ns": drain_ns,
|
||||||
|
"bottleneck_bw_gbs": bottleneck_bw,
|
||||||
|
"expected_total_ns": edge_total_ns + overhead_total_ns + drain_ns,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _print_breakdown(label: str, br: dict) -> None:
|
||||||
|
print(f"\n {label}")
|
||||||
|
print(f" path ({len(br['path'])} nodes):")
|
||||||
|
for nid in br["path"]:
|
||||||
|
print(f" - {nid}")
|
||||||
|
print(f" edges (prop. delay):")
|
||||||
|
for src, dst, dist_mm, prop_ns, bw in br["edges"]:
|
||||||
|
if dist_mm is None:
|
||||||
|
print(f" ! {src} -> {dst} EDGE NOT FOUND IN edge_map")
|
||||||
|
continue
|
||||||
|
print(
|
||||||
|
f" {src} -> {dst} "
|
||||||
|
f"dist={dist_mm:.3f}mm prop={prop_ns:.2f}ns "
|
||||||
|
f"bw={bw or 0:.2f}GB/s"
|
||||||
|
)
|
||||||
|
print(f" per-node overhead_ns:")
|
||||||
|
for nid, oh in br["overheads"]:
|
||||||
|
if oh > 0:
|
||||||
|
print(f" {nid:<60s} overhead_ns={oh:.2f}")
|
||||||
|
print(f" edge_total_ns = {br['edge_total_ns']:.2f}")
|
||||||
|
print(f" overhead_total_ns = {br['overhead_total_ns']:.2f}")
|
||||||
|
print(f" bottleneck_bw_gbs = {br['bottleneck_bw_gbs']}")
|
||||||
|
print(f" drain_ns (nbytes={NBYTES}) = {br['drain_ns']:.2f}")
|
||||||
|
print(f" expected_total_ns = {br['expected_total_ns']:.2f}")
|
||||||
|
|
||||||
|
|
||||||
|
# ── RAW path scenario ────────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def _dump_src_op_records(engine, src_sip, src_cube, src_pe, label) -> None:
|
||||||
|
"""Print op_logger records for ops on the SRC PE.
|
||||||
|
|
||||||
|
The op log captures t_start/t_end for memory/math/gemm/copy ops on
|
||||||
|
every component, so we can see how long tl.load vs tl.store vs
|
||||||
|
tl.send actually took at the engine level.
|
||||||
|
"""
|
||||||
|
op_logger = getattr(engine, "_op_logger", None)
|
||||||
|
if op_logger is None:
|
||||||
|
print(f" ({label}) op_logger not available")
|
||||||
|
return
|
||||||
|
src_prefix = f"sip{src_sip}.cube{src_cube}.pe{src_pe}."
|
||||||
|
recs = [r for r in op_logger.records if r.component_id.startswith(src_prefix)]
|
||||||
|
print(f" ({label}) op_logger records on SRC PE ({src_prefix}*):")
|
||||||
|
for r in recs[:40]:
|
||||||
|
dur = r.t_end - r.t_start
|
||||||
|
comp_short = r.component_id.replace(src_prefix, "")
|
||||||
|
params_short = ""
|
||||||
|
if "nbytes" in r.params:
|
||||||
|
params_short = f" nbytes={r.params['nbytes']}"
|
||||||
|
if "src_addr" in r.params:
|
||||||
|
params_short += f" src_addr={r.params['src_addr']}"
|
||||||
|
if "dst_addr" in r.params:
|
||||||
|
params_short += f" dst_addr={r.params['dst_addr']}"
|
||||||
|
print(
|
||||||
|
f" t=[{r.t_start:7.2f}..{r.t_end:7.2f}] dur={dur:6.2f}ns "
|
||||||
|
f"{comp_short:<25s} {r.op_kind:<8s} {r.op_name:<12s}{params_short}"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _run_raw():
|
||||||
|
captured, restore = _install_per_pe_capture()
|
||||||
|
try:
|
||||||
|
topo = resolve_topology(str(TOPOLOGY_PATH))
|
||||||
|
engine = GraphEngine(topo.topology_obj, enable_data=True)
|
||||||
|
spec = topo.topology_obj.spec
|
||||||
|
|
||||||
|
src_sip, src_cube, src_pe = SRC
|
||||||
|
dst_sip, dst_cube, dst_pe = DST
|
||||||
|
assert src_sip == dst_sip
|
||||||
|
|
||||||
|
src_off = (src_cube * N_PES + src_pe) * N_ELEM * ELEM_BYTES
|
||||||
|
dst_off = (dst_cube * N_PES + dst_pe) * N_ELEM * ELEM_BYTES
|
||||||
|
|
||||||
|
with RuntimeContext(
|
||||||
|
engine=engine,
|
||||||
|
target_device=DeviceSelector("all"),
|
||||||
|
correlation_id="diag_raw",
|
||||||
|
spec=spec,
|
||||||
|
) as rt:
|
||||||
|
dp = DPPolicy(
|
||||||
|
cube="row_wise", pe="column_wise",
|
||||||
|
num_cubes=N_CUBES, num_pes=N_PES,
|
||||||
|
)
|
||||||
|
rt.ahbm.set_device(src_sip)
|
||||||
|
t = rt.zeros(
|
||||||
|
(N_CUBES, N_PES * N_ELEM), dtype="f16",
|
||||||
|
dp=dp, name="raw_tensor",
|
||||||
|
)
|
||||||
|
t.copy_(rt.from_numpy(
|
||||||
|
np.full((N_CUBES, N_PES * N_ELEM), 1.0, dtype=np.float16),
|
||||||
|
))
|
||||||
|
|
||||||
|
def kernel(t_ptr, n_elem, tl):
|
||||||
|
pe_id = tl.program_id(axis=0)
|
||||||
|
cube_id = tl.program_id(axis=1)
|
||||||
|
if cube_id == src_cube and pe_id == src_pe:
|
||||||
|
data = tl.load(
|
||||||
|
t_ptr + src_off, shape=(n_elem,), dtype="f16",
|
||||||
|
)
|
||||||
|
tl.store(t_ptr + dst_off, data)
|
||||||
|
|
||||||
|
pending = rt.launch(
|
||||||
|
"diag_raw_kernel", kernel, t, N_ELEM, _defer_wait=True,
|
||||||
|
)
|
||||||
|
for h, _sip, meta in pending:
|
||||||
|
rt.wait(h, _meta=meta)
|
||||||
|
|
||||||
|
# Compute the RAW sub-txn path: src PE_DMA -> dst HBM_CTRL
|
||||||
|
from kernbench.policy.address.phyaddr import PhysAddr
|
||||||
|
ctx = next(iter(engine._components.values())).ctx
|
||||||
|
src_pe_prefix = f"sip{src_sip}.cube{src_cube}.pe{src_pe}"
|
||||||
|
# Resolve dst PA to HBM controller node
|
||||||
|
# The raw store kernel issues DmaWriteCmd on dst VA; in the engine
|
||||||
|
# this is translated via PE_MMU. For diagnostic we approximate
|
||||||
|
# the destination as the dst cube's HBM controller for slice
|
||||||
|
# belonging to dst_pe.
|
||||||
|
# Use the resolver on a constructed PA matching the same memory
|
||||||
|
# slice the kernel writes to.
|
||||||
|
# The tensor is "row_wise" sharded across cubes, so each cube
|
||||||
|
# owns row[cube_id, :], with each PE owning a column slice.
|
||||||
|
# The actual dst PA depends on the AHBM allocator; we read it
|
||||||
|
# via the tensor's shard map.
|
||||||
|
shard_map = getattr(t, "_shard_map", None) or getattr(t, "shard_map", None)
|
||||||
|
# Fallback: query the resolver directly by constructing a PA in
|
||||||
|
# the dst cube's HBM region. If shard_map is unavailable, still
|
||||||
|
# show the breakdown for src-PE-DMA -> first reachable HBM_CTRL
|
||||||
|
# in dst cube.
|
||||||
|
dst_hbm_id = f"sip{dst_sip}.cube{dst_cube}.hbm_ctrl"
|
||||||
|
if dst_hbm_id not in engine._components:
|
||||||
|
# try alternate naming
|
||||||
|
for nid in engine._components.keys():
|
||||||
|
if (
|
||||||
|
nid.startswith(f"sip{dst_sip}.cube{dst_cube}.")
|
||||||
|
and "hbm" in nid
|
||||||
|
):
|
||||||
|
dst_hbm_id = nid
|
||||||
|
break
|
||||||
|
|
||||||
|
# find_path() prepends ".pe_dma" to src_pe automatically
|
||||||
|
try:
|
||||||
|
raw_path = ctx.router.find_path(src_pe_prefix, dst_hbm_id)
|
||||||
|
except Exception as e:
|
||||||
|
raw_path = []
|
||||||
|
print(f" WARN: find_path raw failed: {e}")
|
||||||
|
if not raw_path:
|
||||||
|
# Try other HBM-related node names in dst cube
|
||||||
|
for nid in engine._components.keys():
|
||||||
|
if not nid.startswith(f"sip{dst_sip}.cube{dst_cube}."):
|
||||||
|
continue
|
||||||
|
if "hbm" not in nid:
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
p = ctx.router.find_path(src_pe_prefix, nid)
|
||||||
|
except Exception:
|
||||||
|
p = []
|
||||||
|
if p:
|
||||||
|
raw_path = p
|
||||||
|
print(f" (fallback raw dst node: {nid})")
|
||||||
|
break
|
||||||
|
|
||||||
|
return captured, ctx, raw_path, engine
|
||||||
|
finally:
|
||||||
|
restore()
|
||||||
|
|
||||||
|
|
||||||
|
# ── IPCQ path scenario ───────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def _run_ipcq():
|
||||||
|
captured, restore = _install_per_pe_capture()
|
||||||
|
dst_pe_ipcq_id = (
|
||||||
|
f"sip{DST[0]}.cube{DST[1]}.pe{DST[2]}.pe_ipcq"
|
||||||
|
)
|
||||||
|
arrival_log, restore_arrival = _install_meta_arrival_capture(
|
||||||
|
dst_pe_ipcq_id,
|
||||||
|
)
|
||||||
|
recv_log, restore_recv = _install_recv_capture(dst_pe_ipcq_id)
|
||||||
|
barrier_log, restore_barrier = _install_barrier_capture()
|
||||||
|
try:
|
||||||
|
topo = resolve_topology(str(TOPOLOGY_PATH))
|
||||||
|
engine = GraphEngine(topo.topology_obj, enable_data=True)
|
||||||
|
spec = topo.topology_obj.spec
|
||||||
|
|
||||||
|
src_sip, src_cube, src_pe = SRC
|
||||||
|
dst_sip, dst_cube, dst_pe = DST
|
||||||
|
|
||||||
|
cfg = load_ccl_config()
|
||||||
|
merged = resolve_algorithm_config(cfg, name="intercube_allreduce")
|
||||||
|
merged["slot_size"] = max(int(merged.get("slot_size", 4096)), NBYTES)
|
||||||
|
|
||||||
|
with RuntimeContext(
|
||||||
|
engine=engine,
|
||||||
|
target_device=DeviceSelector("all"),
|
||||||
|
correlation_id="diag_ipcq",
|
||||||
|
spec=spec,
|
||||||
|
) as rt:
|
||||||
|
configure_sfr_intercube_multisip(engine, spec, merged)
|
||||||
|
dp = DPPolicy(
|
||||||
|
cube="row_wise", pe="column_wise",
|
||||||
|
num_cubes=N_CUBES, num_pes=N_PES,
|
||||||
|
)
|
||||||
|
|
||||||
|
def kernel(t_ptr, n_elem, tl):
|
||||||
|
pe_id = tl.program_id(axis=0)
|
||||||
|
cube_id = tl.program_id(axis=1)
|
||||||
|
if cube_id == src_cube and pe_id == src_pe:
|
||||||
|
data = tl.load(t_ptr, shape=(n_elem,), dtype="f16")
|
||||||
|
tl.send(dir=("E" if HOP == "h3" else "S"), src=data)
|
||||||
|
elif cube_id == dst_cube and pe_id == dst_pe:
|
||||||
|
tl.recv(
|
||||||
|
dir=("W" if HOP == "h3" else "N"),
|
||||||
|
shape=(n_elem,), dtype="f16",
|
||||||
|
)
|
||||||
|
|
||||||
|
tensors = []
|
||||||
|
for s in sorted({src_sip, dst_sip}):
|
||||||
|
rt.ahbm.set_device(s)
|
||||||
|
t = rt.zeros(
|
||||||
|
(N_CUBES, N_PES * N_ELEM), dtype="f16",
|
||||||
|
dp=dp, name=f"sip{s}",
|
||||||
|
)
|
||||||
|
t.copy_(rt.from_numpy(
|
||||||
|
np.full((N_CUBES, N_PES * N_ELEM), 1.0, dtype=np.float16),
|
||||||
|
))
|
||||||
|
tensors.append(t)
|
||||||
|
|
||||||
|
all_pending = []
|
||||||
|
for tt in tensors:
|
||||||
|
pending = rt.launch(
|
||||||
|
"diag_ipcq_kernel", kernel, tt, N_ELEM, _defer_wait=True,
|
||||||
|
)
|
||||||
|
all_pending.extend(pending)
|
||||||
|
for h, _sip, meta in all_pending:
|
||||||
|
rt.wait(h, _meta=meta)
|
||||||
|
|
||||||
|
ctx = next(iter(engine._components.values())).ctx
|
||||||
|
src_pe_prefix = f"sip{src_sip}.cube{src_cube}.pe{src_pe}"
|
||||||
|
dst_pe_dma = f"sip{dst_sip}.cube{dst_cube}.pe{dst_pe}.pe_dma"
|
||||||
|
try:
|
||||||
|
ipcq_path = ctx.router.find_path(src_pe_prefix, dst_pe_dma)
|
||||||
|
except Exception as e:
|
||||||
|
ipcq_path = []
|
||||||
|
print(f" WARN: find_path ipcq failed: {e}")
|
||||||
|
# Snapshot DST PE_IPCQ qp state at end-of-run so we can see what
|
||||||
|
# peer_head_cache/my_tail looked like (and at which directions).
|
||||||
|
qp_state = _snapshot_qp_state(engine, dst_pe_ipcq_id)
|
||||||
|
return (captured, ctx, ipcq_path, engine,
|
||||||
|
arrival_log, qp_state, recv_log, barrier_log)
|
||||||
|
finally:
|
||||||
|
restore_barrier()
|
||||||
|
restore_recv()
|
||||||
|
restore_arrival()
|
||||||
|
restore()
|
||||||
|
|
||||||
|
|
||||||
|
# ── Test entry ───────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.diagnostic
|
||||||
|
def test_pe_to_pe_diagnostic_h3():
|
||||||
|
print("\n" + "=" * 78)
|
||||||
|
print(f" Diagnostic: h3 inter-cube horizontal, nbytes={NBYTES}")
|
||||||
|
print(f" src={SRC} dst={DST}")
|
||||||
|
print("=" * 78)
|
||||||
|
|
||||||
|
# ── RAW scenario
|
||||||
|
print("\n[RAW] tl.load + tl.store (sender pays both legs)")
|
||||||
|
raw_per_pe, raw_ctx, raw_path, raw_engine = _run_raw()
|
||||||
|
print(f" per-PE pe_exec_ns ({len(raw_per_pe)} entries):")
|
||||||
|
src_id = f"sip{SRC[0]}.cube{SRC[1]}.pe{SRC[2]}.pe_cpu"
|
||||||
|
dst_id = f"sip{DST[0]}.cube{DST[1]}.pe{DST[2]}.pe_cpu"
|
||||||
|
for nid in (src_id, dst_id):
|
||||||
|
if nid in raw_per_pe:
|
||||||
|
print(f" {nid:<60s} {raw_per_pe[nid]:.2f} ns <-- key PE")
|
||||||
|
nonzero = {k: v for k, v in raw_per_pe.items() if v > 0.5}
|
||||||
|
if nonzero:
|
||||||
|
print(f" other PEs with pe_exec_ns > 0.5 ns:")
|
||||||
|
for nid, v in sorted(nonzero.items(), key=lambda kv: -kv[1])[:6]:
|
||||||
|
if nid not in (src_id, dst_id):
|
||||||
|
print(f" {nid:<60s} {v:.2f} ns")
|
||||||
|
print(f" max(pe_exec_ns) = "
|
||||||
|
f"{max(raw_per_pe.values()) if raw_per_pe else 0:.2f} ns")
|
||||||
|
|
||||||
|
if raw_path:
|
||||||
|
br = _path_breakdown(raw_ctx, raw_path, NBYTES)
|
||||||
|
_print_breakdown("RAW sub-txn path (src.pe_dma -> dst.hbm_ctrl)", br)
|
||||||
|
_dump_src_op_records(raw_engine, *SRC, "RAW")
|
||||||
|
|
||||||
|
# ── IPCQ scenario
|
||||||
|
print("\n[IPCQ] tl.send + tl.recv (recv pays inbound traversal+drain)")
|
||||||
|
(ipcq_per_pe, ipcq_ctx, ipcq_path, ipcq_engine,
|
||||||
|
arrival_log, qp_state, recv_log, barrier_log) = _run_ipcq()
|
||||||
|
print(f"\n [BARRIER LOG] {len(barrier_log)} _execute_kernel entries:")
|
||||||
|
src_id = f"sip{SRC[0]}.cube{SRC[1]}.pe{SRC[2]}.pe_cpu"
|
||||||
|
dst_id = f"sip{DST[0]}.cube{DST[1]}.pe{DST[2]}.pe_cpu"
|
||||||
|
n_skipped = 0
|
||||||
|
src_entry = None
|
||||||
|
dst_entry = None
|
||||||
|
for e in barrier_log:
|
||||||
|
if e["barrier_skipped"]:
|
||||||
|
n_skipped += 1
|
||||||
|
if e["node_id"] == src_id:
|
||||||
|
src_entry = e
|
||||||
|
if e["node_id"] == dst_id:
|
||||||
|
dst_entry = e
|
||||||
|
print(f" PEs entering _execute_kernel: {len(barrier_log)}")
|
||||||
|
print(f" PEs that SKIPPED barrier (env.now > target_start): {n_skipped}")
|
||||||
|
if src_entry:
|
||||||
|
print(
|
||||||
|
f" SRC pe ({src_id}): entry_now={src_entry['entry_now']:.2f} "
|
||||||
|
f"target_start={src_entry['target_start_ns']:.2f} "
|
||||||
|
f"skipped={src_entry['barrier_skipped']} "
|
||||||
|
f"late_ns={src_entry['delta_late_ns']:.2f}"
|
||||||
|
)
|
||||||
|
if dst_entry:
|
||||||
|
print(
|
||||||
|
f" DST pe ({dst_id}): entry_now={dst_entry['entry_now']:.2f} "
|
||||||
|
f"target_start={dst_entry['target_start_ns']:.2f} "
|
||||||
|
f"skipped={dst_entry['barrier_skipped']} "
|
||||||
|
f"late_ns={dst_entry['delta_late_ns']:.2f}"
|
||||||
|
)
|
||||||
|
# Top 5 latest arrivals
|
||||||
|
sorted_late = sorted(
|
||||||
|
[e for e in barrier_log if e["delta_late_ns"] is not None],
|
||||||
|
key=lambda e: -e["delta_late_ns"],
|
||||||
|
)[:5]
|
||||||
|
print(f" Top 5 latest PE arrivals (positive = barrier missed):")
|
||||||
|
for e in sorted_late:
|
||||||
|
if e["delta_late_ns"] > 0:
|
||||||
|
print(
|
||||||
|
f" {e['node_id']}: late by {e['delta_late_ns']:.2f} ns "
|
||||||
|
f"(entry={e['entry_now']:.2f}, target={e['target_start_ns']:.2f})"
|
||||||
|
)
|
||||||
|
print(f"\n [RECV LOG on dst pe_ipcq] {len(recv_log)} entries:")
|
||||||
|
for e in recv_log:
|
||||||
|
print(
|
||||||
|
f" {e['phase']:5s} t={e['t']:8.2f} ns "
|
||||||
|
f"dir={e['direction']} "
|
||||||
|
f"peer_head_cache={e['peer_head_cache']} "
|
||||||
|
f"my_tail={e['my_tail']}"
|
||||||
|
)
|
||||||
|
print(f"\n [META-ARRIVAL LOG on dst pe_ipcq] {len(arrival_log)} arrivals:")
|
||||||
|
for i, e in enumerate(arrival_log):
|
||||||
|
print(
|
||||||
|
f" #{i:2d} t={e['t']:8.2f} ns "
|
||||||
|
f"src=(sip{e['src_sip']},cube{e['src_cube']},pe{e['src_pe']}) "
|
||||||
|
f"dir={e['src_direction']} "
|
||||||
|
f"sender_seq={e['sender_seq']} "
|
||||||
|
f"matched_dir={e['matched_direction']} "
|
||||||
|
f"nbytes={e['nbytes']}"
|
||||||
|
)
|
||||||
|
for d, ph in e["peer_head_cache_before"].items():
|
||||||
|
mt = e["my_tail_before"][d]
|
||||||
|
if ph != 0 or mt != 0 or d == e["matched_direction"]:
|
||||||
|
print(
|
||||||
|
f" before: dir={d} peer_head_cache={ph} my_tail={mt}"
|
||||||
|
)
|
||||||
|
print(f"\n [QP STATE END-OF-RUN on dst pe_ipcq]:")
|
||||||
|
for d, st in qp_state.items():
|
||||||
|
print(
|
||||||
|
f" dir={d} peer_head_cache={st['peer_head_cache']} "
|
||||||
|
f"my_tail={st['my_tail']} rx_range=[{st['rx_range'][0]}..."
|
||||||
|
f"{st['rx_range'][1]}) n_slots={st['n_slots']} "
|
||||||
|
f"slot_size={st['slot_size']}"
|
||||||
|
)
|
||||||
|
print(f" per-PE pe_exec_ns ({len(ipcq_per_pe)} entries):")
|
||||||
|
for nid in (src_id, dst_id):
|
||||||
|
if nid in ipcq_per_pe:
|
||||||
|
print(f" {nid:<60s} {ipcq_per_pe[nid]:.2f} ns <-- key PE")
|
||||||
|
nonzero = {k: v for k, v in ipcq_per_pe.items() if v > 0.5}
|
||||||
|
if nonzero:
|
||||||
|
print(f" other PEs with pe_exec_ns > 0.5 ns:")
|
||||||
|
for nid, v in sorted(nonzero.items(), key=lambda kv: -kv[1])[:6]:
|
||||||
|
if nid not in (src_id, dst_id):
|
||||||
|
print(f" {nid:<60s} {v:.2f} ns")
|
||||||
|
print(f" max(pe_exec_ns) = "
|
||||||
|
f"{max(ipcq_per_pe.values()) if ipcq_per_pe else 0:.2f} ns")
|
||||||
|
|
||||||
|
if ipcq_path:
|
||||||
|
br = _path_breakdown(ipcq_ctx, ipcq_path, NBYTES)
|
||||||
|
_print_breakdown("IPCQ sub-txn path (src.pe_dma -> peer.pe_dma)", br)
|
||||||
|
_dump_src_op_records(ipcq_engine, *SRC, "IPCQ")
|
||||||
|
_dump_src_op_records(ipcq_engine, *DST, "IPCQ DST")
|
||||||
|
|
||||||
|
# ── Credit-return path analysis (where the missing IPCQ "ack" lives)
|
||||||
|
print("\n" + "-" * 78)
|
||||||
|
print("Credit-return path (current modeling)")
|
||||||
|
print("-" * 78)
|
||||||
|
src_pe_prefix = f"sip{SRC[0]}.cube{SRC[1]}.pe{SRC[2]}"
|
||||||
|
dst_pe_prefix = f"sip{DST[0]}.cube{DST[1]}.pe{DST[2]}"
|
||||||
|
# PE_IPCQ._credit_latency_ns calls
|
||||||
|
# ctx.router.find_path(self._pe_prefix, peer_pe_prefix)
|
||||||
|
# where the *destination* lacks the ".pe_dma" suffix. find_path()
|
||||||
|
# only auto-appends to the source, so this raises -> the except
|
||||||
|
# clause silently returns 0.0. Effectively credit latency = 0.
|
||||||
|
try:
|
||||||
|
ipcq_ctx.router.find_path(dst_pe_prefix, src_pe_prefix)
|
||||||
|
bug_caught = False
|
||||||
|
except Exception as e:
|
||||||
|
bug_caught = True
|
||||||
|
print(f" CONFIRMED BUG in _credit_latency_ns: dest lacks '.pe_dma' "
|
||||||
|
f"-> find_path raises -> caught exception -> returns 0.0")
|
||||||
|
print(f" Error: {e}")
|
||||||
|
# The intended credit path is recv -> sender (reverse data direction)
|
||||||
|
try:
|
||||||
|
credit_path = ipcq_ctx.router.find_path(
|
||||||
|
dst_pe_prefix, f"{src_pe_prefix}.pe_dma",
|
||||||
|
)
|
||||||
|
except Exception as e:
|
||||||
|
credit_path = []
|
||||||
|
print(f" WARN: corrected find_path credit failed: {e}")
|
||||||
|
if credit_path:
|
||||||
|
credit_size = 16 # PE_IPCQ default _credit_size_bytes
|
||||||
|
# Today's modeling: drain only, 16 bytes -> ~0.125 ns
|
||||||
|
cur = ipcq_ctx.compute_drain_ns(credit_path, credit_size)
|
||||||
|
# Proposed modeling: full path latency (edges + node overhead + drain)
|
||||||
|
proposed = ipcq_ctx.compute_path_latency_ns(credit_path, credit_size)
|
||||||
|
print(f" credit path nodes = {len(credit_path)} (recv -> sender)")
|
||||||
|
for nid in credit_path[:6]:
|
||||||
|
print(f" {nid}")
|
||||||
|
if len(credit_path) > 6:
|
||||||
|
print(f" ... {len(credit_path) - 6} more nodes")
|
||||||
|
br = _path_breakdown(ipcq_ctx, credit_path, credit_size)
|
||||||
|
print(f" edge_total_ns = {br['edge_total_ns']:.2f}")
|
||||||
|
print(f" overhead_total_ns = {br['overhead_total_ns']:.2f}")
|
||||||
|
print(f" drain_ns(16 bytes) = {br['drain_ns']:.2f}")
|
||||||
|
print(f" CURRENT _credit_latency_ns (drain only) = {cur:.3f} ns")
|
||||||
|
print(f" PROPOSED (compute_path_latency_ns) = {proposed:.2f} ns")
|
||||||
|
print(f" delta = {proposed - cur:+.2f} ns")
|
||||||
|
|
||||||
|
# ── Comparison summary
|
||||||
|
print("\n" + "-" * 78)
|
||||||
|
print("Summary")
|
||||||
|
print("-" * 78)
|
||||||
|
raw_max = max(raw_per_pe.values()) if raw_per_pe else 0.0
|
||||||
|
ipcq_max = max(ipcq_per_pe.values()) if ipcq_per_pe else 0.0
|
||||||
|
print(f" RAW max(pe_exec_ns) = {raw_max:.2f} ns")
|
||||||
|
print(f" IPCQ max(pe_exec_ns) (current) = {ipcq_max:.2f} ns")
|
||||||
|
print(f" delta (RAW - IPCQ current) = {raw_max - ipcq_max:+.2f} ns")
|
||||||
|
if credit_path:
|
||||||
|
ipcq_with_credit = ipcq_max + (proposed - cur)
|
||||||
|
print(
|
||||||
|
f" IPCQ projected w/ blocking credit + full path overhead "
|
||||||
|
f"= {ipcq_with_credit:.2f} ns"
|
||||||
|
)
|
||||||
|
print(
|
||||||
|
f" delta (RAW - IPCQ projected) = "
|
||||||
|
f"{raw_max - ipcq_with_credit:+.2f} ns "
|
||||||
|
f"(<= 0 means IPCQ >= RAW)"
|
||||||
|
)
|
||||||
|
|
||||||
|
# No assertions — this is observational.
|
||||||
|
assert raw_per_pe, "no RAW pe_exec_ns recorded"
|
||||||
|
assert ipcq_per_pe, "no IPCQ pe_exec_ns recorded"
|
||||||
@@ -0,0 +1,358 @@
|
|||||||
|
"""PE-to-PE latency sweep across hop types and data sizes.
|
||||||
|
|
||||||
|
Compares IPCQ send/recv vs raw-DMA (tl.load + tl.store) latency for five
|
||||||
|
hop types:
|
||||||
|
|
||||||
|
H1 Intra-cube horizontal pe0 → pe1
|
||||||
|
H2 Intra-cube vertical pe0 → pe4
|
||||||
|
H3 Inter-cube horizontal sip0.cube0.pe0 → sip0.cube1.pe0
|
||||||
|
H4 Inter-cube vertical sip0.cube0.pe0 → sip0.cube4.pe0
|
||||||
|
H5 Inter-SIP sip0.cube0.pe0 → sip1.cube0.pe0 (IPCQ only —
|
||||||
|
raw needs
|
||||||
|
cross-SIP MMU)
|
||||||
|
|
||||||
|
Sizes: 128..10240 bytes. Emits PNGs with both lines plus a CSV.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import csv
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from kernbench.ccl.install import load_ccl_config, resolve_algorithm_config
|
||||||
|
from kernbench.ccl.sfr_config import configure_sfr_intercube_multisip
|
||||||
|
from kernbench.policy.placement.dp import DPPolicy
|
||||||
|
from kernbench.runtime_api.context import RuntimeContext
|
||||||
|
from kernbench.runtime_api.types import DeviceSelector
|
||||||
|
from kernbench.sim_engine.engine import GraphEngine
|
||||||
|
from kernbench.topology.builder import resolve_topology
|
||||||
|
|
||||||
|
TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
|
||||||
|
PLOT_DIR = Path(__file__).parent / "pe2pe_latency_plots"
|
||||||
|
|
||||||
|
SIZES = [128, 256, 384, 512, 768, 1024, 2048, 4096, 8192, 10240]
|
||||||
|
|
||||||
|
N_CUBES = 16
|
||||||
|
N_PES = 8
|
||||||
|
ELEM_BYTES = 2 # f16
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class Hop:
|
||||||
|
id: str
|
||||||
|
label: str
|
||||||
|
src: tuple[int, int, int]
|
||||||
|
dst: tuple[int, int, int]
|
||||||
|
send_dir: str
|
||||||
|
recv_dir: str
|
||||||
|
supports_raw: bool # False for cross-SIP (DPPolicy intra-device only)
|
||||||
|
|
||||||
|
|
||||||
|
HOPS = [
|
||||||
|
Hop("h1_intra_horizontal", "Intra-cube horizontal (pe0 to pe1)",
|
||||||
|
(0, 0, 0), (0, 0, 1), "intra_E", "intra_W", True),
|
||||||
|
Hop("h2_intra_vertical", "Intra-cube vertical (pe0 to pe4)",
|
||||||
|
(0, 0, 0), (0, 0, 4), "intra_S", "intra_N", True),
|
||||||
|
Hop("h3_inter_cube_horizontal", "Inter-cube horizontal (cube0 to cube1)",
|
||||||
|
(0, 0, 0), (0, 1, 0), "E", "W", True),
|
||||||
|
Hop("h4_inter_cube_vertical", "Inter-cube vertical (cube0 to cube4)",
|
||||||
|
(0, 0, 0), (0, 4, 0), "S", "N", True),
|
||||||
|
Hop("h5_inter_sip", "Inter-SIP (sip0 to sip1, same cube/pe)",
|
||||||
|
(0, 0, 0), (1, 0, 0), "global_E", "global_W", False),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def _make_engine():
|
||||||
|
topo = resolve_topology(str(TOPOLOGY_PATH))
|
||||||
|
engine = GraphEngine(topo.topology_obj, enable_data=True)
|
||||||
|
return engine, topo.topology_obj.spec
|
||||||
|
|
||||||
|
|
||||||
|
# ── IPCQ path ────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def _measure_ipcq(hop: Hop, nbytes: int) -> float:
|
||||||
|
engine, spec = _make_engine()
|
||||||
|
|
||||||
|
cfg = load_ccl_config()
|
||||||
|
merged = resolve_algorithm_config(cfg, name="intercube_allreduce")
|
||||||
|
merged["slot_size"] = max(int(merged.get("slot_size", 4096)), nbytes)
|
||||||
|
|
||||||
|
n_elem = nbytes // ELEM_BYTES
|
||||||
|
src_sip, src_cube, src_pe = hop.src
|
||||||
|
dst_sip, dst_cube, dst_pe = hop.dst
|
||||||
|
send_dir, recv_dir = hop.send_dir, hop.recv_dir
|
||||||
|
|
||||||
|
with RuntimeContext(
|
||||||
|
engine=engine,
|
||||||
|
target_device=DeviceSelector("all"),
|
||||||
|
correlation_id=f"ipcq_{hop.id}_{nbytes}",
|
||||||
|
spec=spec,
|
||||||
|
) as ctx:
|
||||||
|
configure_sfr_intercube_multisip(engine, spec, merged)
|
||||||
|
|
||||||
|
dp = DPPolicy(
|
||||||
|
cube="row_wise", pe="column_wise",
|
||||||
|
num_cubes=N_CUBES, num_pes=N_PES,
|
||||||
|
)
|
||||||
|
|
||||||
|
def kernel(t_ptr, n_elem, tl):
|
||||||
|
pe_id = tl.program_id(axis=0)
|
||||||
|
cube_id = tl.program_id(axis=1)
|
||||||
|
if cube_id == src_cube and pe_id == src_pe:
|
||||||
|
data = tl.load(t_ptr, shape=(n_elem,), dtype="f16")
|
||||||
|
tl.send(dir=send_dir, src=data)
|
||||||
|
elif cube_id == dst_cube and pe_id == dst_pe:
|
||||||
|
tl.recv(dir=recv_dir, shape=(n_elem,), dtype="f16")
|
||||||
|
|
||||||
|
tensors = []
|
||||||
|
for s in sorted({src_sip, dst_sip}):
|
||||||
|
ctx.ahbm.set_device(s)
|
||||||
|
t = ctx.zeros(
|
||||||
|
(N_CUBES, N_PES * n_elem), dtype="f16",
|
||||||
|
dp=dp, name=f"sip{s}",
|
||||||
|
)
|
||||||
|
t.copy_(ctx.from_numpy(
|
||||||
|
np.full((N_CUBES, N_PES * n_elem), 1.0, dtype=np.float16),
|
||||||
|
))
|
||||||
|
tensors.append(t)
|
||||||
|
|
||||||
|
all_pending = []
|
||||||
|
for t in tensors:
|
||||||
|
pending = ctx.launch(
|
||||||
|
f"{hop.id}_ipcq", kernel, t, n_elem, _defer_wait=True,
|
||||||
|
)
|
||||||
|
all_pending.extend(pending)
|
||||||
|
for h, sip_id, meta in all_pending:
|
||||||
|
ctx.wait(h, _meta=meta)
|
||||||
|
|
||||||
|
# Per-PE kernel execution time (excludes launch dispatch and
|
||||||
|
# response aggregation). IPCQ: DST blocks on tl.recv until the
|
||||||
|
# send arrives, so max across SIPs = DST's transfer time.
|
||||||
|
pe_exec_vals = []
|
||||||
|
for h, _sip, _meta in all_pending:
|
||||||
|
_, trace = engine.get_completion(h)
|
||||||
|
if trace and trace.get("pe_exec_ns") is not None:
|
||||||
|
pe_exec_vals.append(float(trace["pe_exec_ns"]))
|
||||||
|
|
||||||
|
return max(pe_exec_vals) if pe_exec_vals else 0.0
|
||||||
|
|
||||||
|
|
||||||
|
# ── Raw DMA path (intra-SIP only) ────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def _measure_raw(hop: Hop, nbytes: int) -> float:
|
||||||
|
"""tl.load from source slice + tl.store to destination slice. The VA
|
||||||
|
mapping spans the cube mesh within one SIP (MmuMapMsg broadcasts to all
|
||||||
|
cubes of the SIP), so the store goes through the fabric to the
|
||||||
|
destination PE's HBM. No IPCQ protocol involved.
|
||||||
|
"""
|
||||||
|
if not hop.supports_raw:
|
||||||
|
raise RuntimeError(f"hop {hop.id} does not support raw path")
|
||||||
|
|
||||||
|
engine, spec = _make_engine()
|
||||||
|
|
||||||
|
n_elem = nbytes // ELEM_BYTES
|
||||||
|
src_sip, src_cube, src_pe = hop.src
|
||||||
|
dst_sip, dst_cube, dst_pe = hop.dst
|
||||||
|
assert src_sip == dst_sip
|
||||||
|
|
||||||
|
# Slice offsets in the (N_CUBES, N_PES * n_elem) tensor:
|
||||||
|
# row = cube, slice within row = pe * n_elem .. (pe+1)*n_elem
|
||||||
|
# Byte offsets from va_base:
|
||||||
|
src_off = (src_cube * N_PES + src_pe) * n_elem * ELEM_BYTES
|
||||||
|
dst_off = (dst_cube * N_PES + dst_pe) * n_elem * ELEM_BYTES
|
||||||
|
|
||||||
|
with RuntimeContext(
|
||||||
|
engine=engine,
|
||||||
|
target_device=DeviceSelector("all"),
|
||||||
|
correlation_id=f"raw_{hop.id}_{nbytes}",
|
||||||
|
spec=spec,
|
||||||
|
) as ctx:
|
||||||
|
dp = DPPolicy(
|
||||||
|
cube="row_wise", pe="column_wise",
|
||||||
|
num_cubes=N_CUBES, num_pes=N_PES,
|
||||||
|
)
|
||||||
|
ctx.ahbm.set_device(src_sip)
|
||||||
|
t = ctx.zeros(
|
||||||
|
(N_CUBES, N_PES * n_elem), dtype="f16",
|
||||||
|
dp=dp, name="raw_tensor",
|
||||||
|
)
|
||||||
|
t.copy_(ctx.from_numpy(
|
||||||
|
np.full((N_CUBES, N_PES * n_elem), 1.0, dtype=np.float16),
|
||||||
|
))
|
||||||
|
|
||||||
|
def kernel(t_ptr, n_elem, tl):
|
||||||
|
pe_id = tl.program_id(axis=0)
|
||||||
|
cube_id = tl.program_id(axis=1)
|
||||||
|
if cube_id == src_cube and pe_id == src_pe:
|
||||||
|
data = tl.load(
|
||||||
|
t_ptr + src_off, shape=(n_elem,), dtype="f16",
|
||||||
|
)
|
||||||
|
tl.store(t_ptr + dst_off, data)
|
||||||
|
|
||||||
|
pending = ctx.launch(
|
||||||
|
f"{hop.id}_raw", kernel, t, n_elem, _defer_wait=True,
|
||||||
|
)
|
||||||
|
for h, sip_id, meta in pending:
|
||||||
|
ctx.wait(h, _meta=meta)
|
||||||
|
|
||||||
|
# Per-PE kernel execution time. Raw: only SRC does real work
|
||||||
|
# (tl.load + tl.store, store is blocking), so max across all PEs
|
||||||
|
# = SRC's transfer time. Idle PEs contribute only overhead_ns.
|
||||||
|
pe_exec_vals = []
|
||||||
|
for h, _sip, _meta in pending:
|
||||||
|
_, trace = engine.get_completion(h)
|
||||||
|
if trace and trace.get("pe_exec_ns") is not None:
|
||||||
|
pe_exec_vals.append(float(trace["pe_exec_ns"]))
|
||||||
|
|
||||||
|
return max(pe_exec_vals) if pe_exec_vals else 0.0
|
||||||
|
|
||||||
|
|
||||||
|
# ── CSV + plotting ───────────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def _write_csv(records, path: Path) -> None:
|
||||||
|
path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
with open(path, "w", newline="", encoding="utf-8") as f:
|
||||||
|
w = csv.DictWriter(
|
||||||
|
f, fieldnames=["hop", "label", "size_bytes", "path", "total_ns"],
|
||||||
|
)
|
||||||
|
w.writeheader()
|
||||||
|
for r in records:
|
||||||
|
w.writerow(r)
|
||||||
|
|
||||||
|
|
||||||
|
def _plot_per_hop(records, hop: Hop, path: Path) -> None:
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
|
||||||
|
ipcq = sorted(
|
||||||
|
[r for r in records if r["hop"] == hop.id and r["path"] == "ipcq"],
|
||||||
|
key=lambda r: r["size_bytes"],
|
||||||
|
)
|
||||||
|
raw = sorted(
|
||||||
|
[r for r in records if r["hop"] == hop.id and r["path"] == "raw"],
|
||||||
|
key=lambda r: r["size_bytes"],
|
||||||
|
)
|
||||||
|
|
||||||
|
fig, ax = plt.subplots(figsize=(8, 5))
|
||||||
|
if ipcq:
|
||||||
|
ax.plot(
|
||||||
|
[r["size_bytes"] for r in ipcq],
|
||||||
|
[r["total_ns"] for r in ipcq],
|
||||||
|
marker="o", label="IPCQ (send/recv)", color="tab:blue",
|
||||||
|
)
|
||||||
|
if raw:
|
||||||
|
ax.plot(
|
||||||
|
[r["size_bytes"] for r in raw],
|
||||||
|
[r["total_ns"] for r in raw],
|
||||||
|
marker="s", label="Raw DMA (load+store)", color="tab:orange",
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
ax.text(
|
||||||
|
0.98, 0.02, "(Raw DMA unavailable for cross-SIP)",
|
||||||
|
transform=ax.transAxes, ha="right", va="bottom",
|
||||||
|
fontsize=9, color="gray",
|
||||||
|
)
|
||||||
|
ax.set_xlabel("Data size (bytes)")
|
||||||
|
ax.set_ylabel("Latency (ns)")
|
||||||
|
ax.set_title(hop.label)
|
||||||
|
ax.grid(True, alpha=0.3)
|
||||||
|
ax.legend()
|
||||||
|
fig.tight_layout()
|
||||||
|
fig.savefig(path, dpi=120)
|
||||||
|
plt.close(fig)
|
||||||
|
|
||||||
|
|
||||||
|
def _plot_overview(records, path: Path) -> None:
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
|
||||||
|
fig, axes = plt.subplots(2, 3, figsize=(16, 9))
|
||||||
|
axes = axes.flatten()
|
||||||
|
for i, hop in enumerate(HOPS):
|
||||||
|
ax = axes[i]
|
||||||
|
ipcq = sorted(
|
||||||
|
[r for r in records if r["hop"] == hop.id and r["path"] == "ipcq"],
|
||||||
|
key=lambda r: r["size_bytes"],
|
||||||
|
)
|
||||||
|
raw = sorted(
|
||||||
|
[r for r in records if r["hop"] == hop.id and r["path"] == "raw"],
|
||||||
|
key=lambda r: r["size_bytes"],
|
||||||
|
)
|
||||||
|
if ipcq:
|
||||||
|
ax.plot(
|
||||||
|
[r["size_bytes"] for r in ipcq],
|
||||||
|
[r["total_ns"] for r in ipcq],
|
||||||
|
marker="o", label="IPCQ", color="tab:blue",
|
||||||
|
)
|
||||||
|
if raw:
|
||||||
|
ax.plot(
|
||||||
|
[r["size_bytes"] for r in raw],
|
||||||
|
[r["total_ns"] for r in raw],
|
||||||
|
marker="s", label="Raw", color="tab:orange",
|
||||||
|
)
|
||||||
|
ax.set_title(hop.label, fontsize=10)
|
||||||
|
ax.set_xlabel("bytes")
|
||||||
|
ax.set_ylabel("ns")
|
||||||
|
ax.grid(True, alpha=0.3)
|
||||||
|
ax.legend(fontsize=8)
|
||||||
|
for j in range(len(HOPS), len(axes)):
|
||||||
|
axes[j].axis("off")
|
||||||
|
fig.suptitle(
|
||||||
|
"PE-to-PE latency: IPCQ vs raw DMA",
|
||||||
|
fontsize=14,
|
||||||
|
)
|
||||||
|
fig.tight_layout()
|
||||||
|
fig.savefig(path, dpi=120)
|
||||||
|
plt.close(fig)
|
||||||
|
|
||||||
|
|
||||||
|
# ── Test entry ───────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def test_pe_to_pe_latency_sweep():
|
||||||
|
records: list[dict] = []
|
||||||
|
|
||||||
|
for hop in HOPS:
|
||||||
|
for size in SIZES:
|
||||||
|
# IPCQ path
|
||||||
|
ipcq_ns = _measure_ipcq(hop, size)
|
||||||
|
records.append({
|
||||||
|
"hop": hop.id, "label": hop.label,
|
||||||
|
"size_bytes": size, "path": "ipcq",
|
||||||
|
"total_ns": ipcq_ns,
|
||||||
|
})
|
||||||
|
|
||||||
|
raw_s = "n/a"
|
||||||
|
if hop.supports_raw:
|
||||||
|
raw_ns = _measure_raw(hop, size)
|
||||||
|
records.append({
|
||||||
|
"hop": hop.id, "label": hop.label,
|
||||||
|
"size_bytes": size, "path": "raw",
|
||||||
|
"total_ns": raw_ns,
|
||||||
|
})
|
||||||
|
raw_s = f"{raw_ns:7.1f}ns"
|
||||||
|
|
||||||
|
print(
|
||||||
|
f"[{hop.id}] size={size:5d} "
|
||||||
|
f"ipcq={ipcq_ns:7.1f}ns raw={raw_s}"
|
||||||
|
)
|
||||||
|
|
||||||
|
PLOT_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
_write_csv(records, PLOT_DIR / "summary.csv")
|
||||||
|
for hop in HOPS:
|
||||||
|
_plot_per_hop(records, hop, PLOT_DIR / f"{hop.id}.png")
|
||||||
|
_plot_overview(records, PLOT_DIR / "overview.png")
|
||||||
|
|
||||||
|
for hop in HOPS:
|
||||||
|
rs = sorted(
|
||||||
|
[r for r in records if r["hop"] == hop.id and r["path"] == "ipcq"],
|
||||||
|
key=lambda r: r["size_bytes"],
|
||||||
|
)
|
||||||
|
for r in rs:
|
||||||
|
assert r["total_ns"] > 0, f"{hop.id}: total_ns must be > 0"
|
||||||
|
|
||||||
|
print(f"\n Plots + CSV written to {PLOT_DIR}")
|
||||||
@@ -0,0 +1,106 @@
|
|||||||
|
"""Rectangular (non-square) SIP-level 2D topology support.
|
||||||
|
|
||||||
|
Phase 1 regression target: today the 2D builtin topology functions in
|
||||||
|
``kernbench.ccl.topologies`` (``mesh_2d``, ``torus_2d``,
|
||||||
|
``mesh_2d_no_wrap``) hardcode ``side = sqrt(world_size)`` and raise
|
||||||
|
``ValueError`` for any non-square ``world_size``. This blocks running
|
||||||
|
the allreduce sweep at n_sips=6 on torus/mesh layouts.
|
||||||
|
|
||||||
|
Phase 2 will extend these functions to accept optional ``w, h`` kwargs
|
||||||
|
so a 2×3 (or 3×2, etc.) layout works. Until then, every test below is
|
||||||
|
expected to FAIL.
|
||||||
|
|
||||||
|
Layout convention used here (matches non-rectangular case):
|
||||||
|
rank = row * w + col for 0 <= row < h, 0 <= col < w
|
||||||
|
|
||||||
|
For w=2, h=3, world_size=6 the layout is:
|
||||||
|
|
||||||
|
col=0 col=1
|
||||||
|
row=0: 0 1
|
||||||
|
row=1: 2 3
|
||||||
|
row=2: 4 5
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from kernbench.ccl.topologies import (
|
||||||
|
mesh_2d,
|
||||||
|
mesh_2d_no_wrap,
|
||||||
|
torus_2d,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ── mesh_2d_no_wrap (no wrap-around) ──────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def test_mesh_2d_no_wrap_2x3_top_left():
|
||||||
|
"""rank 0 (top-left, no N, no W): only S and E."""
|
||||||
|
nbrs = mesh_2d_no_wrap(rank=0, world_size=6, w=2, h=3)
|
||||||
|
assert nbrs == {"S": 2, "E": 1}, nbrs
|
||||||
|
|
||||||
|
|
||||||
|
def test_mesh_2d_no_wrap_2x3_top_right():
|
||||||
|
"""rank 1 (top-right, no N, no E): only S and W."""
|
||||||
|
nbrs = mesh_2d_no_wrap(rank=1, world_size=6, w=2, h=3)
|
||||||
|
assert nbrs == {"S": 3, "W": 0}, nbrs
|
||||||
|
|
||||||
|
|
||||||
|
def test_mesh_2d_no_wrap_2x3_middle_left():
|
||||||
|
"""rank 2 (middle-left, no W): N, S, E."""
|
||||||
|
nbrs = mesh_2d_no_wrap(rank=2, world_size=6, w=2, h=3)
|
||||||
|
assert nbrs == {"N": 0, "S": 4, "E": 3}, nbrs
|
||||||
|
|
||||||
|
|
||||||
|
def test_mesh_2d_no_wrap_2x3_bottom_right():
|
||||||
|
"""rank 5 (bottom-right, no S, no E): only N and W."""
|
||||||
|
nbrs = mesh_2d_no_wrap(rank=5, world_size=6, w=2, h=3)
|
||||||
|
assert nbrs == {"N": 3, "W": 4}, nbrs
|
||||||
|
|
||||||
|
|
||||||
|
# ── torus_2d (wrap-around on all four edges) ─────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def test_torus_2d_2x3_top_left():
|
||||||
|
"""rank 0: N wraps to row 2 col 0 (rank 4); W wraps to col 1 (rank 1)."""
|
||||||
|
nbrs = torus_2d(rank=0, world_size=6, w=2, h=3)
|
||||||
|
assert nbrs == {"N": 4, "S": 2, "W": 1, "E": 1}, nbrs
|
||||||
|
|
||||||
|
|
||||||
|
def test_torus_2d_2x3_bottom_right():
|
||||||
|
"""rank 5: S wraps to row 0 (rank 1); E wraps to col 0 (rank 4)."""
|
||||||
|
nbrs = torus_2d(rank=5, world_size=6, w=2, h=3)
|
||||||
|
assert nbrs == {"N": 3, "S": 1, "W": 4, "E": 4}, nbrs
|
||||||
|
|
||||||
|
|
||||||
|
# ── mesh_2d alias for torus_2d ───────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def test_mesh_2d_2x3_matches_torus_2d():
|
||||||
|
"""mesh_2d is currently a torus alias; behaviour must match torus_2d."""
|
||||||
|
for rank in range(6):
|
||||||
|
assert mesh_2d(rank=rank, world_size=6, w=2, h=3) == \
|
||||||
|
torus_2d(rank=rank, world_size=6, w=2, h=3)
|
||||||
|
|
||||||
|
|
||||||
|
# ── Back-compat: square layouts still work without w/h kwargs ────────
|
||||||
|
|
||||||
|
|
||||||
|
def test_square_back_compat_mesh_2d_no_wrap():
|
||||||
|
"""Calling without w, h should still work for square world_size."""
|
||||||
|
nbrs = mesh_2d_no_wrap(rank=0, world_size=4)
|
||||||
|
assert nbrs == {"S": 2, "E": 1}, nbrs
|
||||||
|
|
||||||
|
|
||||||
|
def test_square_back_compat_torus_2d():
|
||||||
|
nbrs = torus_2d(rank=0, world_size=4)
|
||||||
|
assert nbrs == {"N": 2, "S": 2, "W": 1, "E": 1}, nbrs
|
||||||
|
|
||||||
|
|
||||||
|
# ── Validation: w*h must match world_size ────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def test_rectangular_dims_must_match_world_size():
|
||||||
|
"""Phase 2 contract: explicit w, h must satisfy w*h == world_size."""
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
mesh_2d_no_wrap(rank=0, world_size=6, w=3, h=3) # 9 != 6
|
||||||