Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)
Major changes:
PE-level IPCQ infrastructure:
- New PE_IPCQ component: ring-buffer control plane with 4-direction
neighbor mapping, head/tail pointers, backpressure (poll/sleep).
- PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA,
including in-flight data snapshot (D9) and op_log recording at
outbound time for Phase 2 replay correctness.
- IpcqDmaToken piggyback model: data + metadata travel together,
atomic visibility at receiver (invariant I6).
- Credit return fast path: bottleneck-BW latency, no fabric vc_comm.
Phase 2 data execution (ADR-0020 integration):
- op_log extended: DmaWriteCmd now captures src_space/src_addr for
Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time.
- DataExecutor replays dma_write + ipcq_copy in t_start order.
- Engine._flush_data_phase: incremental cursor-based replay after
each engine.wait() so host reads see post-Phase-2 data.
- KernelRunner Phase 1 writes disabled when op_log is active to
prevent stale data from corrupting the MemoryStore snapshot.
TLContext / kernel API:
- tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype),
tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode.
- TensorHandle operator overloading (add/sub/mul/div) via thread-local
active TLContext → MathCmd dispatch through PE_MATH.
- PE-local scratch allocator for math output handles.
- tl.load returns space="hbm" handles for correct Phase 2 addressing.
- Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv.
Unified ccl_allreduce bench (PyTorch-compat host code):
- Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch)
split matching real PyTorch DDP worker pattern.
- torch.distributed facade: init_process_group, get_world_size,
get_rank, get_backend, all_reduce, barrier — only real PyTorch names.
- AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches
kernel via tensor shard metadata (n_elem from shards[0].nbytes).
- world_size derived from topology spec (sips × cubes × pes_per_cube)
with optional algorithm-level override in ccl.yaml.
Tensor API (PyTorch-compat surface):
- Tensor.numpy(): gather-aware (all shards via VA-based addressing).
- Tensor.copy_(source): scatter from host tensor into sharded target.
- RuntimeContext.from_numpy(arr): host-side staging tensor.
- Tensor.data property fixed to use numpy() (was shards[0]-only).
Algorithm modules moved to src/kernbench/ccl/algorithms/:
- ring_allreduce, mesh_allreduce, tree_allreduce, hello_send.
- Each module exports kernel_args(world_size, n_elem) helper.
- ccl.yaml module paths updated to kernbench.ccl.algorithms.*.
Dead code removed:
- 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.).
- _run_ccl_bench greenlet-per-SIP scheduler.
- benches.loader.is_ccl_bench + run_rank detection.
- benches/ccl/ directory.
Tests:
- New test_ccl_allreduce_matrix.py: 7 parametrized cases
(ring×3 buffers, ring 8/16, mesh 4, tree 7).
- New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests.
- Existing tests updated for new import paths + world_size_override.
Docs:
- Korean ccl-author-guide.md and ADR-0023 paths updated.
- New English versions: ccl-author-guide.en.md, ADR-0023.en.md.
502 tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,866 @@
|
||||
# ADR-0023: PE-level IPCQ — Inter-PE Collective Communication
|
||||
|
||||
## Status
|
||||
|
||||
Proposed
|
||||
|
||||
## Context
|
||||
|
||||
### Goal
|
||||
|
||||
Add the infrastructure that lets CCL (Collective Communication Library)
|
||||
kernels run **inside** a PE. The host just launches a kernel on each
|
||||
SIP; the actual synchronization and data movement happen **inside the
|
||||
PE kernel via an IPCQ (Inter-Process Communication Queue)**.
|
||||
|
||||
This mirrors how NCCL performs NVLink communication inside a GPU
|
||||
kernel, or how Cerebras / Tenstorrent expose core-local communication
|
||||
queues. Host-level collectives (`dist.all_reduce`) are deferred to
|
||||
**future work**; this ADR focuses solely on the kernel-side collective
|
||||
infrastructure.
|
||||
|
||||
### Current state
|
||||
|
||||
- ADR-0021 PE pipeline refactor: each PE is decomposed into components
|
||||
(PE_CPU, PE_SCHEDULER, PE_DMA, PE_FETCH_STORE, PE_GEMM, PE_MATH,
|
||||
PE_TCM, PE_MMU).
|
||||
- No direct PE-to-PE channel exists today. All data movement goes
|
||||
through PE_DMA → cube_noc / UCIe / PCIE → HBM.
|
||||
- A pre-ADR host CCL skeleton exists (`dist.init_process_group(backend="ahbm")`,
|
||||
`_run_ccl_bench` running per-rank greenlets concurrently). The
|
||||
collective itself is a stub.
|
||||
|
||||
### Problems to solve
|
||||
|
||||
1. PE-to-PE direct data movement (writing into a peer's memory).
|
||||
2. Synchronization — the sender must check that the receiver has space
|
||||
in its buffer (backpressure).
|
||||
3. Resource contention between compute traffic and communication
|
||||
traffic (Head-of-Line blocking).
|
||||
4. The host must be able to construct logical neighbor topologies
|
||||
(ring / mesh / tree) per algorithm.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Add a new `PE_IPCQ` component
|
||||
|
||||
A new component `PE_IPCQ` is added inside each PE. It follows the same
|
||||
pattern as PE_GEMM / PE_MATH — modeling a sub-block of the PE as a
|
||||
distinct component.
|
||||
|
||||
```
|
||||
PE
|
||||
├── PE_CPU
|
||||
├── PE_SCHEDULER
|
||||
├── PE_DMA
|
||||
├── PE_IPCQ ← new
|
||||
├── PE_FETCH_STORE
|
||||
├── PE_GEMM
|
||||
├── PE_MATH
|
||||
├── PE_TCM
|
||||
├── PE_MMU
|
||||
```
|
||||
|
||||
**Role separation** (control plane vs. data plane):
|
||||
|
||||
- **PE_IPCQ (control plane)**: ring-buffer address arithmetic, head /
|
||||
tail pointer management, peer pointer caches, backpressure, 4-direction
|
||||
neighbor mapping.
|
||||
- **PE_DMA (data plane)**: actually moves data through cube_noc / UCIe
|
||||
/ PCIE into the peer's memory.
|
||||
|
||||
PE_IPCQ does **not** move data itself — it delegates to PE_DMA.
|
||||
|
||||
### D2. Ring buffer model
|
||||
|
||||
Each PE owns 4 directions (N/S/E/W) × {tx, rx} = 8 ring buffers.
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class IpcqQueuePair:
|
||||
direction: Direction # N/S/E/W
|
||||
peer: IpcqEndpoint # set by host at init time (D2.5)
|
||||
tx_buffer_base: int # outgoing data base addr (in our memory)
|
||||
rx_buffer_base: int # incoming data base addr (in our memory)
|
||||
slot_size: int # 1 tile per slot
|
||||
n_slots: int # ring depth
|
||||
my_head: int # next slot we will write/send into
|
||||
my_tail: int # next slot we will read/recv from
|
||||
peer_head_cache: int # peer's last-seen head (updated via D9 piggyback)
|
||||
peer_tail_cache: int # peer's last-seen tail (updated via D9 fast-path credit)
|
||||
```
|
||||
|
||||
**Canonical field names**: throughout this ADR the four names above
|
||||
(`my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`) are used
|
||||
consistently. Synonyms (`peer_head_local`, `peer_head`, `peer_tail`,
|
||||
etc.) are not used.
|
||||
|
||||
| Field | Owner | Updated when |
|
||||
|-------|-------|--------------|
|
||||
| `my_head` | local PE_IPCQ | immediately after `tl.send` (send tracking) |
|
||||
| `my_tail` | local PE_IPCQ | immediately after `tl.recv` (recv tracking) |
|
||||
| `peer_head_cache` | local PE_IPCQ | on `IpcqMetaArrival` (D9 piggyback) |
|
||||
| `peer_tail_cache` | local PE_IPCQ | on `IpcqCreditMetadata` (D9 fast path) |
|
||||
|
||||
**Slot unit**: fixed-size, one slot holds one full tile (no descriptor
|
||||
indirection). Full data embedded in the slot. See D5.
|
||||
|
||||
### D2.5. `IpcqEndpoint` schema
|
||||
|
||||
`IpcqQueuePair.peer` carries everything the sender needs to compute the
|
||||
peer's rx slot address:
|
||||
|
||||
```python
|
||||
@dataclass(frozen=True)
|
||||
class IpcqEndpoint:
|
||||
sip: int
|
||||
cube: int
|
||||
pe: int
|
||||
buffer_kind: str # "tcm" | "hbm" | "sram"
|
||||
rx_base_pa: int # peer rx_buffer base PA (PhysAddr.encode())
|
||||
rx_base_va: int # peer rx_buffer base VA (optional, MMU mode)
|
||||
n_slots: int # peer ring depth (for wrap-around)
|
||||
slot_size: int # peer slot size (for offset)
|
||||
```
|
||||
|
||||
Address computation:
|
||||
|
||||
```python
|
||||
slot_idx = self.my_head % peer.n_slots
|
||||
dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size
|
||||
```
|
||||
|
||||
PE_IPCQ passes `dst_pa` to PE_DMA inside an `IpcqDmaToken`. PE_DMA
|
||||
(vc_comm) routes the data to `dst_pa` through the fabric.
|
||||
|
||||
**Endpoint construction order**: at backend init (D10), the IPCQ
|
||||
buffers for **every PE** are allocated first (so each rank knows the
|
||||
others' PA), then the per-rank neighbor tables are built and pushed to
|
||||
PE_IPCQ via `IpcqInitMsg`.
|
||||
|
||||
### D3. Four-direction mapping ≡ logical ProcessGroup
|
||||
|
||||
The PE views four directions (N/S/E/W) as logical ports. Real peer
|
||||
addresses are configured by the host CCL init, per the chosen
|
||||
algorithm. The PE kernel never knows the topology, only directions.
|
||||
|
||||
```python
|
||||
# 1D ring
|
||||
for rank in range(world_size):
|
||||
ipcq_set_neighbor(rank, "E", peer=ranks[(rank + 1) % world_size])
|
||||
ipcq_set_neighbor(rank, "W", peer=ranks[(rank - 1) % world_size])
|
||||
|
||||
# 2D mesh
|
||||
for r in range(R):
|
||||
for c in range(C):
|
||||
ipcq_set_neighbor((r, c), "N", peer=((r - 1) % R, c))
|
||||
ipcq_set_neighbor((r, c), "S", peer=((r + 1) % R, c))
|
||||
ipcq_set_neighbor((r, c), "E", peer=(r, (c + 1) % C))
|
||||
ipcq_set_neighbor((r, c), "W", peer=(r, (c - 1) % C))
|
||||
```
|
||||
|
||||
The PE code does not need to know where `tl.send(dir="E", ...)` actually
|
||||
ends up.
|
||||
|
||||
### D4. PE kernel API
|
||||
|
||||
```python
|
||||
# Send (blocking; may stall on backpressure)
|
||||
tl.send(dir: str, src=TensorHandle)
|
||||
tl.send(dir: str, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)
|
||||
|
||||
# Recv (blocking)
|
||||
recv = tl.recv(dir: str, shape=..., dtype=...)
|
||||
recv = tl.recv(shape=..., dtype=...) # round-robin across 4 directions
|
||||
|
||||
# Recv (non-blocking)
|
||||
fut = tl.recv_async(dir: str, shape=..., dtype=...)
|
||||
recv = tl.wait(fut)
|
||||
```
|
||||
|
||||
`tl.recv()` (no direction) keeps a `last_polled_dir` cursor and on each
|
||||
call rotates through directions, returning the first available slot.
|
||||
Empty in all 4 directions → wait.
|
||||
|
||||
**Fairness is weak**: the rotating start mitigates simple bias, but if
|
||||
one direction always wins the race the others can starve. Algorithms
|
||||
that need strict fairness must call `tl.recv(dir=...)` explicitly.
|
||||
|
||||
### D5. Single-hop DMA write + full-data slot model
|
||||
|
||||
Data moves from sender memory into the receiver's ring slot in **one
|
||||
DMA transfer**. Key properties:
|
||||
|
||||
- **Single-hop**: the sender already knows the peer rx slot address and
|
||||
fires one fabric DMA into it.
|
||||
- **No CPU memcpy**: the CPU never copies data.
|
||||
- **No intermediate staging**: neither side keeps a separate staging
|
||||
buffer (sender uses the source addr directly; receiver gets the data
|
||||
in its ring slot directly).
|
||||
|
||||
(Strictly speaking the fabric DMA write does happen, so this is not
|
||||
literally "no data movement" — it's the same property NCCL labels
|
||||
"zero-copy", meaning no CPU memcpy and no staging copy.)
|
||||
|
||||
```
|
||||
PE A: tl.send(E, src_addr, nbytes)
|
||||
1. IPCQ computes the peer rx slot address:
|
||||
dst_addr = peer.rx_base_pa + (my_head % peer.n_slots) * peer.slot_size
|
||||
2. Backpressure: my_head - peer_tail_cache < peer.n_slots ?
|
||||
(full → sleep / poll)
|
||||
3. Submit DMA on PE_DMA(vc_comm): src_addr → peer dst_addr, nbytes
|
||||
4. my_head += 1
|
||||
|
||||
PE B: data = tl.recv(W)
|
||||
1. Look at rx_buffer[my_tail % n_slots]
|
||||
2. Wait for the data to arrive (D7 backpressure mode)
|
||||
3. Return the slot address to the kernel (or fetch into register file)
|
||||
4. my_tail += 1
|
||||
5. Issue a credit-return fast path (D9): after the bottleneck-BW
|
||||
latency the peer A's peer_tail_cache is updated.
|
||||
```
|
||||
|
||||
The slot holds the full tile. The receiver only reads its own
|
||||
rx_buffer; it never reads back into A's memory. The sender knows the
|
||||
peer rx slot address and DMAs directly into it (single-hop).
|
||||
|
||||
The PE's own PE_TCM read/write does not go through DMA (PE_TCM is local
|
||||
to the PE).
|
||||
|
||||
### D6. Buffer placement — three-way benchmark
|
||||
|
||||
The host CCL init picks the IPCQ ring-buffer location:
|
||||
|
||||
```python
|
||||
ipcq_init(
|
||||
backend="ahbm",
|
||||
buffer_kind="tcm" | "hbm" | "sram",
|
||||
n_slots=8,
|
||||
slot_size=4096,
|
||||
)
|
||||
```
|
||||
|
||||
| Location | Trait | Trade-off |
|
||||
|----------|-------|-----------|
|
||||
| **PE_TCM** | Attached to the PE; fast | Small; competes with PE-internal resources |
|
||||
| **PE-local HBM** | Large; via DMA | Higher latency |
|
||||
| **Cube SRAM** | Mid-size; cube-shared | Cube-internal contention |
|
||||
|
||||
All three locations run the same kernel code; only the init differs.
|
||||
|
||||
### D7. Backpressure — two-mode benchmark
|
||||
|
||||
How the sender or receiver waits when peer slots are full / data not
|
||||
yet arrived:
|
||||
|
||||
| Mode | Behavior | Model |
|
||||
|------|----------|-------|
|
||||
| **poll** | Periodically re-check the cached peer pointer | Spin loop |
|
||||
| **sleep** | Yield a SimPy event; wake on a peer-trigger | Interrupt-like |
|
||||
|
||||
```python
|
||||
ipcq_init(backpressure="poll" | "sleep", ...)
|
||||
```
|
||||
|
||||
Both modes are implemented so latency / throughput trade-offs can be
|
||||
benchmarked.
|
||||
|
||||
### D8. PE_DMA virtual channels
|
||||
|
||||
Extend PE_DMA from a single queue into a **two-channel virtual-channel**
|
||||
model.
|
||||
|
||||
```
|
||||
PE_DMA
|
||||
├── vc_compute: tile load / store / writeback for GEMM and Math
|
||||
└── vc_comm: IPCQ send data
|
||||
```
|
||||
|
||||
Each VC has an independent state machine:
|
||||
|
||||
- One channel stalling does not block the other.
|
||||
- The same physical link (cube_noc, UCIe, …) is shared, but link BW is
|
||||
split between channels.
|
||||
|
||||
**Chunk-level interleave**:
|
||||
|
||||
- Large GEMM tile DMAs do not lock the link end-to-end.
|
||||
- Progress happens in chunks (e.g. 256 B); each chunk shares link BW
|
||||
with the other VC's pending chunks.
|
||||
- Chunk size is an init parameter (smaller = fairer, larger = more
|
||||
efficient).
|
||||
|
||||
Net effect:
|
||||
|
||||
- HoL blocking is eliminated (an IPCQ send can interleave with a long
|
||||
compute DMA).
|
||||
- Compute / comm overlap is natural (NVIDIA copy-engine + compute-SM
|
||||
pattern).
|
||||
- Matches the NoC-virtual-channel pattern used in real HW.
|
||||
|
||||
**First-implementation accuracy limit (intentional)**: this ADR's
|
||||
first cut uses **deterministic chunk-level interleave + weighted
|
||||
round-robin arbitration** (default 50 / 50, exposed in `ccl.yaml`).
|
||||
This is a first-order approximation and is simpler than real HW
|
||||
dynamic-contention / credit-based arbiters. Functional correctness is
|
||||
unaffected, but heavy-contention scenarios may report slightly
|
||||
optimistic latency vs. real HW. A separate ADR can add a NoC arbiter
|
||||
component later if more precision is needed.
|
||||
|
||||
#### Token routing
|
||||
|
||||
- Compute tokens (`TileToken`) — go through the existing
|
||||
PE_FETCH_STORE → PE_DMA chain.
|
||||
- Communication tokens (`IpcqDmaToken`, new) — PE_IPCQ → PE_DMA
|
||||
self-routing.
|
||||
- PE_DMA picks the channel by token type.
|
||||
|
||||
```python
|
||||
class PeDmaComponent:
|
||||
def _process(self, env, token):
|
||||
if isinstance(token, IpcqDmaToken):
|
||||
yield from self._vc_comm_process(env, token)
|
||||
else:
|
||||
yield from self._vc_compute_process(env, token)
|
||||
```
|
||||
|
||||
### D9. Pointer synchronization — DMA payload piggyback
|
||||
|
||||
Real HW (NVLink, UCIe, etc.) piggybacks metadata onto DMA payloads so
|
||||
pointers update along with the data. This simulation adopts the same
|
||||
model: **no separate control channel** — metadata travels with the
|
||||
data.
|
||||
|
||||
The big benefits:
|
||||
|
||||
- **Automatic ordering**: data and metadata move on the same token, so
|
||||
data is visible **before** the head_cache update. No race.
|
||||
- **HW fidelity**: matches NVLink / UCIe piggybacked headers.
|
||||
- **Component simplification**: no separate `IpcqPtrUpdate` event type.
|
||||
|
||||
#### Send flow (head update via piggyback)
|
||||
|
||||
```
|
||||
PE A: tl.send(E, src_addr, nbytes)
|
||||
1. PE_IPCQ checks backpressure (using peer_tail_cache)
|
||||
2. PE_IPCQ creates an IpcqDmaToken:
|
||||
- data body (src_addr → peer dst_addr)
|
||||
- piggyback metadata: (sender_seq, src_sip/cube/pe, src_direction)
|
||||
3. Hand the token to PE_DMA(vc_comm)
|
||||
4. PE A increments my_head (send tracking)
|
||||
|
||||
[fabric DMA: latency elapses]
|
||||
|
||||
PE B's PE_DMA receives the token
|
||||
5. Writes data into dst_addr (B's rx slot) via MemoryStore.write
|
||||
6. Forwards token metadata to PE B's PE_IPCQ (PE-internal wire, ~1 cycle)
|
||||
|
||||
PE B's PE_IPCQ receives the metadata
|
||||
7. Updates peer_head_cache (= A's head)
|
||||
8. Wakes any pending recv on that direction
|
||||
```
|
||||
|
||||
**Steps 5 and 6 must execute in the same SimPy step** — DMA completion
|
||||
makes data and metadata atomically visible.
|
||||
|
||||
#### Recv flow (credit return — fast path with bottleneck-BW latency)
|
||||
|
||||
When the receiver frees a slot, the sender must learn about it
|
||||
(backpressure release). Unlike data, the credit return does **not**
|
||||
travel through general vc_comm fabric — it uses a **separate fast
|
||||
path**, an abstraction of the NVLink / UCIe credit-return wire.
|
||||
|
||||
**Latency** is computed from the **bottleneck BW on the path**, not a
|
||||
magic constant:
|
||||
|
||||
```
|
||||
credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes)
|
||||
path = router.find_path(self_pe, peer_pe)
|
||||
latency = compute_drain_ns(path, credit_size_bytes)
|
||||
= credit_size_bytes / bottleneck_bw_on_path
|
||||
```
|
||||
|
||||
That gives us:
|
||||
|
||||
- **Topology-proportional approximation**: an in-cube credit return is
|
||||
automatically faster than a cross-SIP credit return.
|
||||
- **No magic constants**: no arbitrary `ipcq_ctrl_latency_ns`.
|
||||
- **No deadlock risk**: unlike piggyback, B can issue credit even when
|
||||
it has no data to send back.
|
||||
- **Reuses existing utility**: `ComponentContext.compute_drain_ns`.
|
||||
|
||||
#### Component coupling — SimPy Store channel
|
||||
|
||||
PE B's PE_IPCQ does not call PE A's PE_IPCQ directly. Instead, at init
|
||||
time, **a SimPy Store is wired between the two** (a per-direction
|
||||
fast-path channel) and credit metadata is `put` into that store.
|
||||
|
||||
```python
|
||||
class PeIpcqComponent:
|
||||
def _delayed_credit_send(self, env, peer_credit_store, my_tail, latency_ns):
|
||||
yield env.timeout(latency_ns)
|
||||
yield peer_credit_store.put(IpcqCreditMetadata(seq=my_tail, ...))
|
||||
```
|
||||
|
||||
Backend init wires both directions of the fast-path channel as part of
|
||||
fan-out (see `IpcqInitMsg` in D12).
|
||||
|
||||
#### Credit-return fast path limitations
|
||||
|
||||
- `credit_size_bytes` is an estimate (typically 16–64 bytes).
|
||||
- The fast path is **excluded from vc_comm BW contention** (separate
|
||||
wire). Real HW credit-return wires are very lightweight, so this is a
|
||||
reasonable first approximation.
|
||||
- A follow-up ADR can: model the credit fast path as a separate link
|
||||
(BW limit + contention), or switch to piggyback (`credit_return_mode:
|
||||
piggyback`).
|
||||
|
||||
#### PE_DMA's added responsibility
|
||||
|
||||
When `vc_comm` receives a token, PE_DMA processes it as the following
|
||||
**atomic** sequence. **No SimPy yield is allowed between the two steps**
|
||||
(invariant I6):
|
||||
|
||||
```python
|
||||
def _on_vc_comm_recv(self, env, token):
|
||||
# ── ATOMIC: no yield between these two operations ──
|
||||
data = self._memory_store.read(token.src_space, token.src_addr,
|
||||
shape=..., dtype=...)
|
||||
self._memory_store.write(token.dst_endpoint.buffer_kind,
|
||||
token.dst_addr, data)
|
||||
# 2. Forward metadata to the local PE_IPCQ
|
||||
yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token))
|
||||
# ───────────────────────────────────────────────────
|
||||
```
|
||||
|
||||
The final `put` is yieldable but uses an unbounded internal store, so
|
||||
it completes in a single step. That `put` is the closing call of the
|
||||
atomic block; nothing may be inserted before it.
|
||||
|
||||
### D9.5. ADR-0020 (2-pass) integration
|
||||
|
||||
`tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase
|
||||
1 simulates timing **and** moves data via MemoryStore; Phase 2 enables
|
||||
op-log-based correctness verification.
|
||||
|
||||
#### Phase 1 (timing + data)
|
||||
|
||||
D9 models head and tail updates with two different mechanisms:
|
||||
|
||||
- **Send-side (head update)** — DMA payload piggyback. Data write and
|
||||
metadata forward happen in the same SimPy step → automatic atomic
|
||||
visibility.
|
||||
- **Recv-side (tail credit return)** — fast-path SimPy Store channel
|
||||
with bottleneck-BW latency, then `peer_tail_cache` update.
|
||||
|
||||
Together they preserve ring-buffer pointer consistency.
|
||||
|
||||
The op-log records `op_kind="ipcq"` entries for sends (with
|
||||
`src/dst/space/addr/nbytes/dir/dtype/shape/sender_seq`) and recvs (with
|
||||
`recv_mode/src/dst/space/addr/nbytes/dir/dtype/shape/consumer_seq`).
|
||||
Two recv modes:
|
||||
|
||||
- **`return_slot`** (default): the slot address is returned to the
|
||||
kernel. Zero-copy.
|
||||
- **`copy_to_dst`**: when the kernel passes `dst_addr` + `dst_space`,
|
||||
PE_IPCQ copies the slot data into the user dst.
|
||||
|
||||
#### Phase 2 (op_log replay)
|
||||
|
||||
When `DataExecutor` encounters an `op_kind="ipcq"` record:
|
||||
|
||||
- **send**: idempotent `src → dst` ndarray write.
|
||||
- **recv (`return_slot`)**: no-op (the slot already holds the data).
|
||||
- **recv (`copy_to_dst`)**: idempotent `slot → dst_addr` copy.
|
||||
|
||||
IPCQ ops are pure data movement — Phase 2 has nothing extra to compute.
|
||||
The downstream GEMM / Math ops in `DataExecutor` will consume the data
|
||||
and naturally validate correctness.
|
||||
|
||||
### D10. Host CCL init keeps the PyTorch shape
|
||||
|
||||
The host code looks just like real PyTorch DDP. `init_process_group`
|
||||
creates the backend object; it does **not** receive IPCQ knobs
|
||||
(neighbor topology, buffer_kind, backpressure …).
|
||||
|
||||
```python
|
||||
# benches/ccl_allreduce.py — same shape as real PyTorch
|
||||
def worker(rank, world_size, torch):
|
||||
dist = torch.distributed
|
||||
dist.init_process_group(backend="ahbm") # reads ccl.yaml + topology
|
||||
tensor = torch.zeros((1, world_size * N_ELEM), dtype="f16", dp=...)
|
||||
tensor.copy_(torch.from_numpy(init))
|
||||
dist.all_reduce(tensor, op="sum")
|
||||
```
|
||||
|
||||
The IPCQ configuration is decided by the backend at
|
||||
`init_process_group` time: it loads `ccl.yaml`, picks the algorithm,
|
||||
and pushes IPCQ neighbor tables to every participating PE_IPCQ. The
|
||||
host code never has to know about IPCQ.
|
||||
|
||||
A bench runs one algorithm, chosen via `ccl.yaml`'s `defaults.algorithm`.
|
||||
Switching algorithms is purely a `ccl.yaml` change — no host edits
|
||||
required.
|
||||
|
||||
#### Init flow (eager)
|
||||
|
||||
1. `init_process_group(backend="ahbm")` is called.
|
||||
2. Backend loads `ccl.yaml` → resolves `defaults.algorithm`.
|
||||
3. Pulls topology + buffer_kind + backpressure + slot config from
|
||||
`algorithms[<algo>]`.
|
||||
4. **Immediately** installs neighbor tables on every PE_IPCQ
|
||||
(sideband or fabric `IpcqInitMsg`).
|
||||
5. Subsequent `torch.launch(kernel_name, ...)` calls behave normally —
|
||||
PE_IPCQ is already prepared whether the kernel is a CCL kernel or
|
||||
not.
|
||||
|
||||
### D11. CCL config file (`ccl.yaml`)
|
||||
|
||||
IPCQ config and algorithm metadata live in a separate YAML file,
|
||||
following the same pattern as `components.yaml` and `topology.yaml`.
|
||||
|
||||
A single benchmark execution runs one algorithm
|
||||
(`defaults.algorithm`). Switching algorithms means editing
|
||||
`defaults.algorithm` only.
|
||||
|
||||
```yaml
|
||||
defaults:
|
||||
algorithm: ring_allreduce_tcm
|
||||
buffer_kind: tcm # tcm | hbm | sram
|
||||
backpressure: sleep # poll | sleep
|
||||
n_slots: 8
|
||||
slot_size: 4096
|
||||
vc_chunk_size: 256
|
||||
ipcq_credit_size_bytes: 16
|
||||
|
||||
algorithms:
|
||||
ring_allreduce_tcm:
|
||||
module: kernbench.ccl.algorithms.ring_allreduce
|
||||
topology: ring_1d # builtin name or "custom"
|
||||
buffer_kind: tcm
|
||||
n_elem: 8 # optional, per-algorithm tile width
|
||||
|
||||
tree_allreduce_7:
|
||||
module: kernbench.ccl.algorithms.tree_allreduce
|
||||
topology: tree_binary
|
||||
buffer_kind: tcm
|
||||
world_size: 7 # algorithm-level override
|
||||
n_elem: 16
|
||||
|
||||
custom_mesh:
|
||||
module: kernbench.ccl.algorithms.custom_mesh
|
||||
topology: custom # the module supplies its own neighbors()
|
||||
```
|
||||
|
||||
`world_size` is **not set in `defaults`**. The backend resolves it via:
|
||||
`algorithm-level override > defaults override > topology spec`. The
|
||||
last fallback (`sips × cubes_per_sip × pes_per_cube`) mirrors real DDP
|
||||
where `WORLD_SIZE` comes from env vars rather than config files.
|
||||
|
||||
#### Algorithm module structure
|
||||
|
||||
Each algorithm module exports two hooks — `kernel` (required) and
|
||||
`neighbors` (optional) — plus a `kernel_args` helper that the
|
||||
backend uses to populate positional kernel arguments at `all_reduce`
|
||||
time:
|
||||
|
||||
```python
|
||||
# src/kernbench/ccl/algorithms/ring_allreduce.py
|
||||
|
||||
def kernel_args(world_size: int, n_elem: int) -> tuple:
|
||||
return (n_elem, world_size)
|
||||
|
||||
|
||||
def kernel(t_ptr, n_elem, world_size, tl):
|
||||
"""Required — the PE kernel.
|
||||
|
||||
IPCQ is already installed by the backend before this is called.
|
||||
The kernel only uses the four-direction send / recv API.
|
||||
"""
|
||||
...
|
||||
|
||||
|
||||
def neighbors(rank, world_size, neighbor_map):
|
||||
"""Optional — override the builtin topology's neighbor map.
|
||||
|
||||
Returns a new dict, the modified-in-place dict, or None to keep the
|
||||
builtin map.
|
||||
"""
|
||||
return None
|
||||
```
|
||||
|
||||
#### `neighbors` override patterns
|
||||
|
||||
- **Pattern A — tweak a builtin**: drop a direction for some ranks, etc.
|
||||
- **Pattern B — replace entirely**: ignore `neighbor_map` and return a
|
||||
brand-new dict.
|
||||
- **Pattern C — keep builtin**: omit `neighbors` or return None.
|
||||
|
||||
#### Builtin topologies
|
||||
|
||||
| topology | direction set |
|
||||
|----------|---------------|
|
||||
| `ring_1d` | E, W |
|
||||
| `ring_1d_unidir` | E only |
|
||||
| `mesh_2d` | N, S, E, W |
|
||||
| `tree_binary` | parent, child_left, child_right |
|
||||
| `none` | (empty) — algorithm must supply `neighbors()` |
|
||||
|
||||
#### Adding a new algorithm
|
||||
|
||||
1. Write `kernel` and `kernel_args` in
|
||||
`src/kernbench/ccl/algorithms/<algo>.py`.
|
||||
2. Add an entry in `ccl.yaml`'s `algorithms` section.
|
||||
3. (Optional) provide `neighbors()` for custom topology.
|
||||
4. Set `defaults.algorithm` to the new algorithm.
|
||||
|
||||
The host bench (`benches/ccl_allreduce.py`) does not change.
|
||||
|
||||
### D12. Message / token schema
|
||||
|
||||
The new message types added by this ADR. They live in
|
||||
`src/kernbench/common/pe_commands.py` and
|
||||
`src/kernbench/runtime_api/kernel.py`.
|
||||
|
||||
#### `IpcqInitMsg` (sideband, fan-out at init)
|
||||
|
||||
The backend pushes neighbor tables to every PE_IPCQ. Structure mirrors
|
||||
`MmuMapMsg` (`target_sips`, `target_cubes`, `target_pe`, `entries`).
|
||||
Each `IpcqInitEntry` has `direction`, `peer: IpcqEndpoint`,
|
||||
`my_rx_base_pa/va`, `n_slots`, `slot_size`, plus a `peer_credit_store`
|
||||
field — a `simpy.Store` instance pre-wired so the sender PE_IPCQ can
|
||||
push `IpcqCreditMetadata` directly into the receiver's input queue.
|
||||
|
||||
#### `IpcqSendCmd` (PE_CPU → PE_IPCQ)
|
||||
|
||||
Carries `direction`, source addr/space, nbytes, shape, dtype, and a
|
||||
handle id. `data_op=True` so it lands in the op_log.
|
||||
|
||||
#### `IpcqRecvCmd` (PE_CPU → PE_IPCQ)
|
||||
|
||||
Carries `direction` (or None for round-robin), `recv_mode`
|
||||
(`return_slot` / `copy_to_dst`), optional `dst_addr/dst_space`, shape,
|
||||
dtype, blocking flag.
|
||||
|
||||
#### `IpcqDmaToken` (PE_IPCQ → PE_DMA, vc_comm channel)
|
||||
|
||||
Per D9 piggyback: the token carries the data (`src/dst/space/nbytes`)
|
||||
plus the head metadata (`sender_seq`, `src_sip/cube/pe`,
|
||||
`src_direction`). PE_DMA picks the channel by token type
|
||||
(`IpcqDmaToken → vc_comm`, `TileToken → vc_compute`).
|
||||
|
||||
The receiver's PE_DMA, on token arrival, performs the I6 atomic
|
||||
sequence: write data into MemoryStore, then forward `IpcqMetaArrival`
|
||||
to the local PE_IPCQ.
|
||||
|
||||
#### `IpcqCreditMetadata` (PE_IPCQ → peer PE_IPCQ, fast path)
|
||||
|
||||
Carries `consumer_seq` (= my_tail), source PE coords, and source
|
||||
direction. Travels through the dedicated SimPy Store channel rather
|
||||
than `vc_comm`. Latency = `credit_size_bytes / bottleneck_bw_on_path`.
|
||||
|
||||
There is **no `IpcqPtrUpdate` event** — head updates flow via D9
|
||||
piggyback, tail updates via the D9 fast-path channel.
|
||||
|
||||
### D13. Test strategy
|
||||
|
||||
Following the ADR-0021 D8 pattern.
|
||||
|
||||
#### T1. Unit tests (component-level)
|
||||
|
||||
- **PE_IPCQ** (`tests/test_pe_ipcq.py`): send without backpressure
|
||||
immediately forwards a token; full peer slot triggers backpressure
|
||||
(poll / sleep modes); recv waits, wakes on `IpcqMetaArrival`;
|
||||
round-robin recv weak fairness; bad direction → `IpcqInvalidDirection`.
|
||||
- **PE_DMA virtual channels** (`tests/test_pe_dma_vc.py`): `vc_compute`
|
||||
/ `vc_comm` independent progress, chunk interleave, BW split.
|
||||
- **Builtin topology** (`tests/test_ccl_topologies.py`): ring_1d /
|
||||
mesh_2d / tree_binary correctness, mesh_2d non-square →
|
||||
`ValueError`, custom resolver returns the module's `neighbors`.
|
||||
|
||||
#### T2. Integration tests (E2E send/recv)
|
||||
|
||||
- **`tests/test_ipcq_e2e.py`**: 2-rank ring, 4-rank ring (bidirectional
|
||||
no-deadlock), 4×4 mesh.
|
||||
- **CCL kernel + 2-pass** (`tests/test_ipcq_2pass.py`): greenlet mode
|
||||
records `ipcq` ops in op_log; DataExecutor produces correct
|
||||
`out.data`.
|
||||
|
||||
#### T3. Backend init (`tests/test_ccl_backend_ipcq.py`)
|
||||
|
||||
`ccl.yaml` load, builtin topology → `IpcqInitMsg` fan-out, endpoint PA
|
||||
consistency, per-`buffer_kind` allocation.
|
||||
|
||||
#### T4. Regression
|
||||
|
||||
All existing tests pass; ADR-0020 op_log / DataExecutor unaffected for
|
||||
non-CCL benches.
|
||||
|
||||
#### T5. Performance / overhead
|
||||
|
||||
Single send/recv pair latency = (DMA latency) + (IPCQ overhead).
|
||||
Should be close to a regular PE_DMA write of the same nbytes (IPCQ
|
||||
overhead < 100 ns).
|
||||
|
||||
### D14. Invariants and failure modes
|
||||
|
||||
#### Invariants
|
||||
|
||||
I1. **Slot lifecycle exactly-once**: one send → exactly one recv.
|
||||
I2. **Pointer monotonicity**: `my_head` / `my_tail` strictly
|
||||
non-decreasing; `sender_seq` strictly increasing.
|
||||
I3. **Endpoint consistency**: if rank A's `direction=E` peer is rank
|
||||
B, then rank B's reverse-direction peer must be rank A. Verified at
|
||||
init.
|
||||
I4. **`buffer_kind` consistency**: all PEs in a process group share
|
||||
the same `buffer_kind` (no mixed mode in the first cut).
|
||||
I5. **op_log ordering**: send → DMA complete → recv possible. The
|
||||
t_start order in op_log respects this causality.
|
||||
I6. **Atomic data + metadata visibility (MUST)**: at the receiver
|
||||
side, data write (`MemoryStore.write`) and metadata forward
|
||||
(`peer_head_cache` update) **must execute in the same SimPy step**.
|
||||
No yield is allowed between the two operations in PE_DMA's vc_comm
|
||||
handler. Code review must reject any inserted `yield` (or `yield
|
||||
from`) — it would create a race where head_cache becomes visible
|
||||
before or after the data.
|
||||
I7. **MemoryStore slot existence ↔ pointer**: as a consequence of I6,
|
||||
the step in which `peer_head_cache > my_tail` becomes truthy is the
|
||||
same step in which the slot data is observable.
|
||||
|
||||
#### Failure modes (runtime errors)
|
||||
|
||||
F1. **Bad direction**: `tl.send(dir="X")` for an uninstalled direction
|
||||
→ `IpcqInvalidDirection`, simulation aborts.
|
||||
F2. **Type mismatch**: dtype/shape/nbytes disagreement between matched
|
||||
send and recv. Not validated by default; opt-in strict mode catches
|
||||
it (`strict_validation: true` on a PE_IPCQ node attrs).
|
||||
F3. **Deadlock detection (timeout-based)**: the simulator empties its
|
||||
schedule while a send/recv is still pending → engine raises
|
||||
`IpcqDeadlock` and embeds a pointer dump.
|
||||
F4. **Backend init failure**: missing `defaults.algorithm`, missing
|
||||
`algorithms[name]`, module import failure, topology validation
|
||||
failure (I3, I4) — all raised at `init_process_group` time.
|
||||
F5. **Slot full + infinite backpressure**: the peer never recvs.
|
||||
Surfaces as F3 timeout.
|
||||
|
||||
#### Diagnostics
|
||||
|
||||
- **CCL trace**: `KERNBENCH_CCL_TRACE=1` logs each send/recv as
|
||||
`(rank, t, dir, nbytes)`.
|
||||
- **Pointer dump**: `kernbench.ccl.diagnostics.pointer_dump(engine)`
|
||||
prints every PE_IPCQ ring buffer's `my_head`, `my_tail`,
|
||||
`peer_head_cache`, `peer_tail_cache`.
|
||||
- **Deadlock dump**: on hang the engine includes the pointer dump in
|
||||
the `IpcqDeadlock` exception message.
|
||||
|
||||
### D15. Algorithm-author cheat sheet
|
||||
|
||||
Full step-by-step lives in
|
||||
[`docs/ccl-author-guide.en.md`](../ccl-author-guide.en.md). The
|
||||
shortest version:
|
||||
|
||||
| Things you touch | Things you don't |
|
||||
|------------------|-------------------|
|
||||
| `src/kernbench/ccl/algorithms/<your_algo>.py` (`kernel`, `kernel_args`, optional `neighbors`) | `benches/ccl_allreduce.py` host code |
|
||||
| One entry in `ccl.yaml` + optionally `defaults.algorithm` | `src/kernbench/ccl/` framework |
|
||||
| (Optional) `tests/test_<your_algo>.py` mock test | PE_IPCQ component, AhbmCCLBackend |
|
||||
|
||||
5-step flow: write the kernel → register in `ccl.yaml` → optional
|
||||
`neighbors` override → optional mock unit test → SimPy validation via
|
||||
`kernbench run --bench ccl_allreduce --verify-data`.
|
||||
|
||||
Common mistakes: using a direction that wasn't installed, sends
|
||||
without matching recvs (deadlock), dtype/shape disagreement, assuming
|
||||
fairness from `tl.recv()` round-robin, confusing
|
||||
`tl.num_programs(axis)` with the CCL group size.
|
||||
|
||||
---
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **Host collective**: a model where `dist.all_reduce` itself moves
|
||||
data on the host side is out of scope. This ADR only covers
|
||||
communication that happens inside the PE kernel.
|
||||
- **All-reduce algorithms**: ring / tree / etc. live in algorithm
|
||||
modules and can be added without amending this ADR.
|
||||
- **Reliability / error handling**: link faults, send/recv failure
|
||||
recovery, etc. are out of scope.
|
||||
- **NoC arbiter precision**: dynamic VC contention is left for a future
|
||||
ADR (see D8).
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
- **VC arbitration accuracy** — the first cut uses deterministic
|
||||
chunk interleave + weighted round-robin; heavy contention may report
|
||||
optimistic latency. A NoC arbiter component can be added later.
|
||||
- **Credit return BW model** — the fast path is currently outside the
|
||||
fabric BW contention model. Can be modeled as a separate link or
|
||||
switched to piggyback (`credit_return_mode: piggyback`).
|
||||
- **Ring buffer slot allocation metadata** — whether the host pushes
|
||||
IPCQ buffer metadata via sideband or via a fabric message similar to
|
||||
`MmuMapMsg` is open.
|
||||
- **VC BW split default** — 50/50 vs. weighted (e.g. 80/20). Exposed in
|
||||
`ccl.yaml`; default value TBD.
|
||||
- **Direction count** — 4 (N/S/E/W) is fixed in the first cut; 6
|
||||
(with Up/Down for 3D) or N (variable) is future work.
|
||||
- **Multi-tile aggregation primitives** — whether
|
||||
`tl.recv_all` or similar is needed for fan-in.
|
||||
- **Round-robin recv fairness** — current weak fairness can starve;
|
||||
strict fairness counter is future work.
|
||||
- **Deadlock detection precision** — currently timeout-based; a
|
||||
realtime wait-for graph would enable deterministic detection.
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- PE-to-PE direct communication enables CCL kernels to be written.
|
||||
- Host stays minimal (just `launch`), synchronization happens inside
|
||||
the PE → strong compute / comm overlap.
|
||||
- VCs eliminate HoL blocking → collective latency is not blocked by
|
||||
compute traffic.
|
||||
- Buffer placement and backpressure mode are init-time parameters →
|
||||
easy to benchmark.
|
||||
- Four-direction logical neighbors → host is free to map
|
||||
ring/mesh/tree algorithms.
|
||||
|
||||
### Negative
|
||||
|
||||
- One new component (PE_IPCQ) and a redesigned PE_DMA (VCs).
|
||||
- IPCQ memory cost = 8 rings × `slot_size` × `n_slots` per PE.
|
||||
- VC arbitration is a first-order approximation; heavy contention
|
||||
scenarios may report slightly optimistic latency vs real HW (D8).
|
||||
- Chunk-level interleave makes PE_DMA implementation more complex.
|
||||
|
||||
---
|
||||
|
||||
## Affected files
|
||||
|
||||
| File | Change |
|
||||
|------|--------|
|
||||
| `topology.yaml` | Add `pe_ipcq` to `pe_template`, plus the IPCQ ↔ DMA / CPU / TCM edges. |
|
||||
| `components.yaml` | Register `pe_ipcq_v1`. |
|
||||
| `src/kernbench/topology/builder.py` | Wire the IPCQ chain into PE-internal edges. |
|
||||
| `src/kernbench/components/builtin/pe_ipcq.py` | New. |
|
||||
| `src/kernbench/components/builtin/pe_dma.py` | Add VCs, handle `IpcqDmaToken`. |
|
||||
| `src/kernbench/common/pe_commands.py` | `IpcqSendCmd`, `IpcqRecvCmd`, `IpcqDmaToken`. |
|
||||
| `src/kernbench/triton_emu/tl_context.py` | `tl.send` / `tl.recv` API. |
|
||||
| `src/kernbench/runtime_api/distributed.py` | Eager IPCQ install in `AhbmCCLBackend.__init__`. |
|
||||
| `src/kernbench/runtime_api/kernel.py` | `IpcqInitMsg` definition. |
|
||||
| `src/kernbench/ccl/__init__.py` | New CCL package. |
|
||||
| `src/kernbench/ccl/topologies.py` | Builtin topology generators + `resolve_topology()`. |
|
||||
| `src/kernbench/ccl/helpers.py` | Algorithm-author helpers (`chunked`, `ring_step`, `tree_step`). |
|
||||
| `src/kernbench/ccl/testing.py` | Mock CCL runtime (`run_kernel_in_mock`). |
|
||||
| `src/kernbench/ccl/algorithms/*.py` | Algorithm modules (kernel + `kernel_args` + optional `neighbors`). |
|
||||
| `ccl.yaml` | Algorithm metadata + IPCQ defaults. |
|
||||
| `tests/test_pe_ipcq.py` | PE_IPCQ unit tests. |
|
||||
| `tests/test_pe_dma_vc.py` | PE_DMA VC tests. |
|
||||
| `tests/test_ipcq_e2e.py` | end-to-end send/recv tests. |
|
||||
| `tests/test_ccl_topologies.py` | Builtin topology generator tests. |
|
||||
| `tests/test_ccl_allreduce_matrix.py` | Unified bench × algorithm matrix. |
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,592 @@
|
||||
# CCL Algorithm Author Guide (English)
|
||||
|
||||
This document is a step-by-step guide for engineers writing CCL
|
||||
(Collective Communication Library) algorithms in kernbench. The
|
||||
internal system design and component structure live in
|
||||
[ADR-0023](adr/ADR-0023-ipcq-pe-collective.md).
|
||||
|
||||
The goal here is to clearly separate **what an algorithm author has to
|
||||
touch** from **what they can leave alone**, and to get a first
|
||||
algorithm running through the shortest possible path.
|
||||
|
||||
---
|
||||
|
||||
## 0. Five-minute tour
|
||||
|
||||
| Things you touch | Location |
|
||||
|------------------|----------|
|
||||
| Algorithm module (kernel + optional `neighbors()`) | `src/kernbench/ccl/algorithms/<algo>.py` |
|
||||
| Algorithm registration | `ccl.yaml` |
|
||||
| Host bench (rank count, init, launch, verify) | `benches/<your_bench>.py` |
|
||||
| (Optional) unit test | `tests/test_<algo>.py` |
|
||||
|
||||
| Things you do NOT touch | Location |
|
||||
|--------------------------|----------|
|
||||
| TLContext API | `src/kernbench/triton_emu/tl_context.py` (ADR-0022 spec) |
|
||||
| Framework (topology generators, helpers, mock testing) | `src/kernbench/ccl/` |
|
||||
| PE_IPCQ / PE_DMA components | `src/kernbench/components/builtin/` |
|
||||
| Backend implementation (`install_ipcq`) | `src/kernbench/runtime_api/distributed.py` and `kernbench/ccl/install.py` |
|
||||
|
||||
Workflow:
|
||||
1. Write a `kernel` function in the algorithm module.
|
||||
2. Register an entry in `ccl.yaml`.
|
||||
3. Write a host bench using `torch.distributed.init_process_group` /
|
||||
`torch.distributed.all_reduce` (the unified `benches/ccl_allreduce.py`
|
||||
handles the common case).
|
||||
4. (Optional) Run the mock runtime for fast unit tests (a few ms).
|
||||
5. `kernbench run --bench <name> --verify-data` for full SimPy verification.
|
||||
|
||||
---
|
||||
|
||||
## 1. Hello World — the simplest send/recv
|
||||
|
||||
Each PE sends its tile to its E neighbor once and receives a tile from
|
||||
its W neighbor once. The reference code lives in
|
||||
[`src/kernbench/ccl/algorithms/hello_send.py`](../src/kernbench/ccl/algorithms/hello_send.py).
|
||||
|
||||
### Step 1: write the kernel
|
||||
|
||||
New file `src/kernbench/ccl/algorithms/hello_send.py`:
|
||||
|
||||
```python
|
||||
"""Hello world: send your tile to the next rank, receive from the previous one."""
|
||||
|
||||
|
||||
def kernel(t_ptr, n_elem, tl):
|
||||
# Global rank is computed from program_id(0/1) (ADR-0022).
|
||||
local_pe = tl.program_id(axis=0)
|
||||
cube_id = tl.program_id(axis=1)
|
||||
pes_per_cube = tl.num_programs(axis=0)
|
||||
rank = cube_id * pes_per_cube + local_pe
|
||||
|
||||
nbytes = n_elem * 2 # f16
|
||||
pe_addr = t_ptr + rank * nbytes
|
||||
|
||||
# Load our slice and send it east.
|
||||
src = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
|
||||
tl.send(dir="E", src=src)
|
||||
|
||||
# Receive from west and store directly back into our slice.
|
||||
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
|
||||
tl.store(pe_addr, recv)
|
||||
|
||||
|
||||
def kernel_args(world_size: int, n_elem: int) -> tuple:
|
||||
"""Positional kernel args used by the ahbm backend (after t_ptr)."""
|
||||
return (n_elem,)
|
||||
```
|
||||
|
||||
Key points:
|
||||
|
||||
- **Global rank is computed from `program_id(axis=0)` + `program_id(axis=1)`.**
|
||||
TL has no contractually-supported `tl.rank` / `tl.world_size`. If the
|
||||
host needs to pass `world_size` or anything else as an algorithm
|
||||
parameter, it goes through ordinary `torch.launch` arguments.
|
||||
- **`tl.send` takes a `TensorHandle`.** PE_IPCQ reads
|
||||
`addr`/`space`/`shape`/`dtype`/`nbytes` from the handle to issue an
|
||||
`IpcqDmaToken` to PE_DMA.
|
||||
- **`tl.recv` requires `shape` and `dtype`.** The returned TensorHandle
|
||||
points at the IPCQ ring slot and can be used directly as a `dst`
|
||||
handle (e.g. `tl.store(pe_addr, recv)`). Phase 2's `dma_write` replay
|
||||
handles the (slot → hbm) copy, so user code never has to touch
|
||||
`recv.data`.
|
||||
|
||||
### Step 2: register in `ccl.yaml`
|
||||
|
||||
```yaml
|
||||
algorithms:
|
||||
hello_send:
|
||||
module: kernbench.ccl.algorithms.hello_send
|
||||
topology: ring_1d
|
||||
buffer_kind: tcm
|
||||
world_size: 8
|
||||
```
|
||||
|
||||
`world_size` here is optional. If absent, `AhbmCCLBackend` derives it
|
||||
from the topology spec (`sips × cubes_per_sip × pes_per_cube`).
|
||||
|
||||
### Step 3: write a host bench (optional — the unified bench may suffice)
|
||||
|
||||
For most CCL benchmarks the existing `benches/ccl_allreduce.py` is
|
||||
sufficient: it reads `ccl.yaml`, picks the algorithm, sets up the
|
||||
process group, and runs the collective. If your algorithm needs custom
|
||||
host logic, write a new bench file along the same lines.
|
||||
|
||||
The host code looks like a real PyTorch DDP worker:
|
||||
|
||||
```python
|
||||
"""benches/ccl_hello.py"""
|
||||
from __future__ import annotations
|
||||
|
||||
import numpy as np
|
||||
|
||||
from kernbench.policy.placement.dp import DPPolicy
|
||||
|
||||
|
||||
N_ELEM = 8
|
||||
|
||||
|
||||
def worker(rank: int, world_size: int, torch) -> None:
|
||||
"""Per-rank business logic — mirrors a real PyTorch DDP worker."""
|
||||
dp = DPPolicy(
|
||||
sip="replicate", cube="replicate", pe="column_wise",
|
||||
num_sips=1, num_cubes=1, num_pes=world_size,
|
||||
)
|
||||
tensor = torch.zeros(
|
||||
(1, world_size * N_ELEM), dtype="f16", dp=dp, name="hello_in",
|
||||
)
|
||||
|
||||
# Per-rank initialization via the real PyTorch idiom.
|
||||
init = np.zeros((1, world_size * N_ELEM), dtype=np.float16)
|
||||
for r in range(world_size):
|
||||
init[0, r * N_ELEM : (r + 1) * N_ELEM] = float(r + 1)
|
||||
tensor.copy_(torch.from_numpy(init))
|
||||
|
||||
# The collective itself.
|
||||
torch.distributed.all_reduce(tensor, op="sum")
|
||||
|
||||
# Verify on rank 0 (real PyTorch DDP idiom).
|
||||
if rank == 0:
|
||||
result = tensor.numpy()
|
||||
for r in range(world_size):
|
||||
expected = float(((r - 1) % world_size) + 1)
|
||||
slice_r = result[0, r * N_ELEM : (r + 1) * N_ELEM]
|
||||
print(
|
||||
f" rank {r}: got {float(slice_r.mean()):.1f}, "
|
||||
f"expected {expected:.1f}"
|
||||
)
|
||||
|
||||
|
||||
def run(torch) -> None:
|
||||
"""CLI entry point. Initializes dist, dispatches to worker."""
|
||||
dist = torch.distributed
|
||||
dist.init_process_group(backend="ahbm")
|
||||
worker(
|
||||
rank=dist.get_rank(),
|
||||
world_size=dist.get_world_size(),
|
||||
torch=torch,
|
||||
)
|
||||
```
|
||||
|
||||
### Step 4: unit test (optional but strongly recommended)
|
||||
|
||||
`tests/test_hello_send.py`:
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
from kernbench.ccl.algorithms.hello_send import kernel
|
||||
from kernbench.ccl.testing import run_kernel_in_mock
|
||||
|
||||
|
||||
def test_hello_send_4_ranks():
|
||||
n_elem = 8
|
||||
inputs = [
|
||||
np.full((n_elem,), float(r + 1), dtype=np.float16)
|
||||
for r in range(4)
|
||||
]
|
||||
outputs = run_kernel_in_mock(
|
||||
kernel_fn=kernel,
|
||||
world_size=4,
|
||||
topology="ring_1d",
|
||||
inputs=inputs,
|
||||
kernel_args=(n_elem,),
|
||||
)
|
||||
# rank r should now hold rank (r-1) % 4's data.
|
||||
for r in range(4):
|
||||
assert np.array_equal(outputs[r], inputs[(r - 1) % 4])
|
||||
```
|
||||
|
||||
`run_kernel_in_mock` runs every rank concurrently in pure Python (no
|
||||
SimPy), so a unit test like this finishes in **milliseconds**. It only
|
||||
verifies algorithmic correctness — no latency, no DMA, no fabric.
|
||||
|
||||
### Step 5: SimPy validation
|
||||
|
||||
```bash
|
||||
kernbench run --topology topology.yaml --bench ccl_hello --verify-data
|
||||
```
|
||||
|
||||
Phase 1 runs the SimPy simulation + MemoryStore data movement, Phase 2
|
||||
replays the op_log for correctness. The bench's `print` lines should
|
||||
show OK for every rank.
|
||||
|
||||
---
|
||||
|
||||
## 2. Ring all-reduce — the second algorithm
|
||||
|
||||
Slightly more complex. Each PE runs `world_size - 1` rounds, sending
|
||||
its current tile east and accumulating the tile received from the west.
|
||||
After all rounds, every PE holds the global sum.
|
||||
|
||||
The reference implementation lives in
|
||||
[`src/kernbench/ccl/algorithms/ring_allreduce.py`](../src/kernbench/ccl/algorithms/ring_allreduce.py).
|
||||
The core flow:
|
||||
|
||||
```python
|
||||
"""Ring all-reduce."""
|
||||
|
||||
|
||||
def kernel(t_ptr, n_elem, world_size, tl):
|
||||
local_pe = tl.program_id(axis=0)
|
||||
cube_id = tl.program_id(axis=1)
|
||||
pes_per_cube = tl.num_programs(axis=0)
|
||||
rank = cube_id * pes_per_cube + local_pe
|
||||
nbytes = n_elem * 2
|
||||
pe_addr = t_ptr + rank * nbytes
|
||||
|
||||
# The handle points at HBM[pe_addr]. In greenlet mode .data is
|
||||
# populated, but the kernel never has to touch .data directly.
|
||||
acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
|
||||
current = acc # source for the first send
|
||||
|
||||
for _step in range(world_size - 1):
|
||||
tl.send(dir="E", src=current)
|
||||
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
|
||||
# TensorHandle operator overload → MathCmd → PE_MATH dispatch.
|
||||
# Phase 1 only models timing; Phase 2 DataExecutor replays the
|
||||
# actual numpy accumulation.
|
||||
acc = acc + recv
|
||||
current = recv # forward the received slot to the next round
|
||||
|
||||
# Store the final accumulator back to HBM. Source is acc (a PE-local
|
||||
# scratch addr); dst is HBM. The op_log dma_write entry records both
|
||||
# ends so Phase 2 copies the math result into HBM at verify time.
|
||||
tl.store(pe_addr, acc)
|
||||
|
||||
|
||||
def kernel_args(world_size: int, n_elem: int) -> tuple:
|
||||
return (n_elem, world_size)
|
||||
```
|
||||
|
||||
Four key points:
|
||||
|
||||
1. **Accumulation goes through TensorHandle operators.** `acc + recv`
|
||||
emits a `MathCmd` and dispatches it through PE_MATH — i.e. the
|
||||
real hardware path, so the latency model stays accurate. Per
|
||||
ADR-0020 D3, Phase 1 only simulates timing; Phase 2's `DataExecutor`
|
||||
replays the op_log and runs the actual numpy accumulation.
|
||||
2. **Use `current = recv` to forward.** Each round must update the send
|
||||
source to the just-received slot handle so the same data circulates
|
||||
exactly once around the ring. Setting `current = acc` would resend
|
||||
the cumulative sum, inflating the result.
|
||||
3. **`tl.store(pe_addr, acc)` exactly once at the end.** Do not use a
|
||||
store→reload pattern in the middle. `acc` lives in PE-local scratch;
|
||||
the op_log records `(src=scratch, dst=hbm)` and Phase 2 first runs
|
||||
math (filling scratch) then copies via the dma_write snapshot.
|
||||
4. **`world_size` is passed by the host explicitly.** TL only knows the
|
||||
topology slot count (e.g. `num_programs(axis=0)` is "PEs per cube"),
|
||||
not the participating CCL group size. The host bench knows
|
||||
`world_size` and forwards it as an explicit kernel argument.
|
||||
|
||||
For registration in `ccl.yaml` and wiring through the unified bench,
|
||||
look at the existing `ring_allreduce_tcm/_hbm/_sram` entries plus
|
||||
[`benches/ccl_allreduce.py`](../benches/ccl_allreduce.py). Mock unit
|
||||
tests live in
|
||||
[`tests/test_ccl_mock_runtime.py`](../tests/test_ccl_mock_runtime.py)
|
||||
and follow the `kernel_args=(n_elem, world_size)` convention.
|
||||
|
||||
---
|
||||
|
||||
## 3. `neighbors()` override — custom topology
|
||||
|
||||
Most algorithms are happy with the builtin topologies (`ring_1d`,
|
||||
`mesh_2d`, `tree_binary`, `ring_1d_unidir`, `none`). If you want to
|
||||
modify a builtin or define a brand-new connectivity pattern, define a
|
||||
`neighbors()` function in your algorithm module.
|
||||
|
||||
### Signature
|
||||
|
||||
```python
|
||||
def neighbors(
|
||||
rank: int, world_size: int, neighbor_map: dict[str, int],
|
||||
) -> dict[str, int] | None:
|
||||
"""Override the neighbor map produced by the builtin topology.
|
||||
|
||||
Args:
|
||||
neighbor_map: the mapping the ccl.yaml ``topology`` field built.
|
||||
For ring_1d this is {"E": (rank+1)%ws, "W": (rank-1)%ws}.
|
||||
The dict is mutable — modify in place if you want.
|
||||
|
||||
Returns:
|
||||
dict: the new neighbor map (or the modified-in-place dict).
|
||||
None: do not override; use neighbor_map as-is.
|
||||
"""
|
||||
return None
|
||||
```
|
||||
|
||||
### Pattern A: tweak a builtin
|
||||
|
||||
```python
|
||||
def neighbors(rank, world_size, neighbor_map):
|
||||
# Only even ranks use W; remove W from odd ranks.
|
||||
if rank % 2 == 1:
|
||||
neighbor_map.pop("W", None)
|
||||
return neighbor_map
|
||||
```
|
||||
|
||||
### Pattern B: replace entirely (skip-connection ring)
|
||||
|
||||
```python
|
||||
def neighbors(rank, world_size, neighbor_map):
|
||||
return {"E": (rank + 2) % world_size}
|
||||
```
|
||||
|
||||
### Pattern C: keep builtin
|
||||
|
||||
Either omit `neighbors` entirely or return None:
|
||||
|
||||
```python
|
||||
def neighbors(rank, world_size, neighbor_map):
|
||||
return None # explicit "use the builtin"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. PE kernel API reference (ADR-0023 D4)
|
||||
|
||||
### IPCQ API
|
||||
|
||||
| API | Description | Blocking? |
|
||||
|-----|-------------|-----------|
|
||||
| `tl.send(dir, src=TensorHandle)` | Send to a peer in the given direction. | Yes (waits if peer slots are full) |
|
||||
| `tl.send(dir, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)` | Same, keyword form. | Yes |
|
||||
| `tl.recv(dir, shape=..., dtype=...)` | Blocking recv from one direction. | Yes |
|
||||
| `tl.recv(shape=..., dtype=...)` | Round-robin recv across all four directions. | Yes |
|
||||
| `tl.recv_async(dir, shape=..., dtype=...) → RecvFuture` | Non-blocking recv. | No |
|
||||
| `tl.wait(future)` | Wait for a non-blocking recv future → returns the resolved TensorHandle. | Yes |
|
||||
|
||||
### Existing TL API (ADR-0020/0022, unchanged)
|
||||
|
||||
| API | Description |
|
||||
|-----|-------------|
|
||||
| `tl.load(addr, shape, dtype) → TensorHandle` | DMA read; in greenlet mode `.data` carries the ndarray. |
|
||||
| `tl.store(addr, handle)` | DMA write — when `handle.data` is set the runner propagates it to MemoryStore. |
|
||||
| `tl.composite(op, ...)` | Submit a GEMM/Math composite (non-blocking). |
|
||||
| `tl.program_id(axis=0)` | Local PE id within the cube. |
|
||||
| `tl.program_id(axis=1)` | Cube id (ADR-0022). |
|
||||
| `tl.num_programs(axis=0/1)` | Topology slot counts (NOT the participating-rank count). |
|
||||
|
||||
### Two recv modes
|
||||
|
||||
The default is `return_slot` (zero-copy): the IPCQ slot address is
|
||||
returned in `handle.addr`. To force a copy into a custom destination,
|
||||
pass `dst_addr` + `dst_space`:
|
||||
|
||||
```python
|
||||
recv = tl.recv(
|
||||
dir="W", shape=(8,), dtype="f16",
|
||||
dst_addr=my_scratch_addr,
|
||||
dst_space="hbm",
|
||||
)
|
||||
# After this call recv.addr == my_scratch_addr (copy_to_dst mode).
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Helpers (`kernbench.ccl.helpers`)
|
||||
|
||||
Convenience helpers to keep algorithm code short:
|
||||
|
||||
```python
|
||||
from kernbench.ccl.helpers import chunked, ring_step, tree_step
|
||||
```
|
||||
|
||||
### `chunked(base_addr, n_chunks, n_elem, dtype="f16") → list[Chunk]`
|
||||
|
||||
Split a tile of `n_elem` elements into `n_chunks` equal-size views.
|
||||
Each `Chunk` has `addr`, `n_elem`, `nbytes` fields.
|
||||
|
||||
```python
|
||||
chunks = chunked(t_ptr, n_chunks=4, n_elem=64, dtype="f16")
|
||||
# chunks[0..3] are 16-element views with consecutive addresses.
|
||||
```
|
||||
|
||||
### `ring_step(rank, step, world_size) → (send_idx, recv_idx)`
|
||||
|
||||
Per-step chunk indices for a ring algorithm (reduce-scatter / all-gather):
|
||||
|
||||
```python
|
||||
for step in range(world_size - 1):
|
||||
send_idx, recv_idx = ring_step(rank, step, world_size)
|
||||
tl.send(
|
||||
dir="E", src_addr=chunks[send_idx].addr,
|
||||
nbytes=chunks[send_idx].nbytes,
|
||||
shape=(chunks[send_idx].n_elem,), dtype="f16",
|
||||
)
|
||||
recv = tl.recv(
|
||||
dir="W", shape=(chunks[recv_idx].n_elem,), dtype="f16",
|
||||
)
|
||||
# accumulate ...
|
||||
```
|
||||
|
||||
### `tree_step(rank, world_size) → {"parent": int|None, "children": list[int]}`
|
||||
|
||||
Parent / children rank ids for a binary tree:
|
||||
|
||||
```python
|
||||
info = tree_step(rank, world_size)
|
||||
if info["parent"] is None:
|
||||
print(f"rank {rank} is the root")
|
||||
for child in info["children"]:
|
||||
...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Unit testing — Mock runtime
|
||||
|
||||
`kernbench.ccl.testing.run_kernel_in_mock` runs an algorithm without
|
||||
SimPy for fast feedback.
|
||||
|
||||
### Basic usage
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
from kernbench.ccl.testing import run_kernel_in_mock
|
||||
from kernbench.ccl.algorithms.my_algo import kernel
|
||||
|
||||
|
||||
def test_my_algo():
|
||||
n_elem = 16
|
||||
inputs = [np.arange(n_elem, dtype="f16") + r for r in range(4)]
|
||||
expected = sum(inputs)
|
||||
outputs = run_kernel_in_mock(
|
||||
kernel_fn=kernel,
|
||||
world_size=4,
|
||||
topology="ring_1d",
|
||||
inputs=inputs,
|
||||
kernel_args=(n_elem, 4), # positional args after t_ptr
|
||||
)
|
||||
for r in range(4):
|
||||
assert np.allclose(outputs[r], expected, rtol=1e-3)
|
||||
```
|
||||
|
||||
### Behavior
|
||||
|
||||
- All ranks run their kernels concurrently as cooperative greenlets.
|
||||
- `tl.send` / `tl.recv` are serviced by in-memory FIFOs (no DMA, no
|
||||
latency).
|
||||
- Each rank's last `store` is what the helper returns as a numpy array.
|
||||
|
||||
### Limitations
|
||||
|
||||
- No latency or performance numbers (it is not a simulation).
|
||||
- No PE_DMA, fabric, or BW model.
|
||||
- Correctness only.
|
||||
- One cube assumed: `program_id(axis=1)` is always 0.
|
||||
|
||||
---
|
||||
|
||||
## 7. Debugging
|
||||
|
||||
### CCL trace
|
||||
|
||||
```bash
|
||||
KERNBENCH_CCL_TRACE=1 kernbench run --topology topology.yaml \
|
||||
--bench ccl_allreduce --verify-data
|
||||
```
|
||||
|
||||
Per-rank send/recv events appear on stdout:
|
||||
|
||||
```
|
||||
[ccl t=346.4 send] sip0.cube0.pe1 dir=E nbytes=64 seq=0
|
||||
[ccl t=360.4 recv] sip0.cube0.pe2 dir=W nbytes=64
|
||||
```
|
||||
|
||||
### Pointer dump
|
||||
|
||||
`kernbench.ccl.diagnostics.pointer_dump(engine)` returns a multi-line
|
||||
dump of every PE_IPCQ ring buffer's `my_head`, `my_tail`,
|
||||
`peer_head_cache`, `peer_tail_cache`. When something hangs, this shows
|
||||
which rank is stuck and on what.
|
||||
|
||||
### Deadlock detection
|
||||
|
||||
When the SimPy schedule empties because of unmatched send/recv pairs,
|
||||
the engine raises `IpcqDeadlock` and embeds the pointer dump in the
|
||||
message (ADR-0023 D14 F3). Wait-for-graph visualization is future
|
||||
work.
|
||||
|
||||
---
|
||||
|
||||
## 8. Common mistakes
|
||||
|
||||
### 1. Using a direction that wasn't installed
|
||||
|
||||
`topology: ring_1d` only installs E and W. Trying:
|
||||
|
||||
```python
|
||||
tl.send(dir="N", ...) # → IpcqInvalidDirection
|
||||
```
|
||||
|
||||
Fix: switch to `topology: mesh_2d`, or add N/S in a `neighbors()` override.
|
||||
|
||||
### 2. `send` without a matching `recv`
|
||||
|
||||
```python
|
||||
def kernel(..., tl):
|
||||
for _ in range(100):
|
||||
tl.send(dir="E", ...)
|
||||
# The peer never recvs → ring buffer fills → backpressure → deadlock.
|
||||
```
|
||||
|
||||
Fix: every `send` needs a matching `recv` on the receiver side.
|
||||
Otherwise `IpcqDeadlock` is raised.
|
||||
|
||||
### 3. dtype/shape mismatch
|
||||
|
||||
By default mismatches are not validated. The author is responsible for
|
||||
consistency. Set `strict_validation: true` on a PE_IPCQ node's attrs to
|
||||
enable D14 F2 strict mode and catch them immediately.
|
||||
|
||||
### 4. Assuming round-robin recv fairness
|
||||
|
||||
`tl.recv()` (no direction) returns the first slot to arrive in
|
||||
round-robin order, but **arrival order is not predictable**. If your
|
||||
algorithm depends on a particular direction, name it explicitly:
|
||||
`tl.recv(dir="N", ...)`.
|
||||
|
||||
### 5. Confusing `num_programs` with the CCL group size
|
||||
|
||||
`tl.num_programs(axis=0/1)` reports topology slot counts, not the
|
||||
number of ranks participating in the collective. The host bench knows
|
||||
`world_size` and must pass it through as a kernel argument.
|
||||
|
||||
### 6. Overwriting the send source before it's actually sent
|
||||
|
||||
PE_DMA snapshots the source data into the IpcqDmaToken at send time,
|
||||
preserving in-flight semantics. Even so, the safest pattern is to call
|
||||
`tl.send` first and only mutate the source addr afterwards. If you
|
||||
mutate the addr before `tl.send` makes it into the PE_DMA queue, the
|
||||
snapshot will pick up the wrong data.
|
||||
|
||||
---
|
||||
|
||||
## 9. Next steps
|
||||
|
||||
- Try other topologies (`mesh_2d`, `tree_binary`).
|
||||
- Faster algorithms (recursive halving / doubling).
|
||||
- Compare `buffer_kind` (tcm/hbm/sram) and `backpressure` (poll/sleep)
|
||||
modes for latency.
|
||||
- Larger-scale validation through the unified `ccl_allreduce` bench
|
||||
with different `ccl.yaml` overlays.
|
||||
|
||||
If you add a new algorithm or pattern, please send a PR.
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [ADR-0023](adr/ADR-0023-ipcq-pe-collective.md): IPCQ + PE-level collective design.
|
||||
- [ADR-0022](adr/ADR-0022-program-id-2d-grid.md): 2D grid program_id (axis=0/1).
|
||||
- [ADR-0020](adr/ADR-0020-data-execution-two-pass.md): 2-pass data execution.
|
||||
- [ADR-0021](adr/ADR-0021-pe-pipeline-refactor.md): PE pipeline refactor.
|
||||
|
||||
Existing algorithm examples:
|
||||
|
||||
- [`src/kernbench/ccl/algorithms/hello_send.py`](../src/kernbench/ccl/algorithms/hello_send.py) — simplest send/recv
|
||||
- [`src/kernbench/ccl/algorithms/ring_allreduce.py`](../src/kernbench/ccl/algorithms/ring_allreduce.py) — ring all-reduce
|
||||
- [`src/kernbench/ccl/algorithms/mesh_allreduce.py`](../src/kernbench/ccl/algorithms/mesh_allreduce.py) — 2D mesh all-reduce
|
||||
- [`src/kernbench/ccl/algorithms/tree_allreduce.py`](../src/kernbench/ccl/algorithms/tree_allreduce.py) — binary tree all-reduce
|
||||
@@ -0,0 +1,537 @@
|
||||
# CCL Algorithm Author Guide
|
||||
|
||||
이 문서는 kernbench에서 CCL (Collective Communication Library) 알고리즘을
|
||||
직접 작성하는 사람을 위한 step-by-step 가이드이다. 시스템 내부 설계와
|
||||
컴포넌트 구조는 [ADR-0023](adr/ADR-0023-ipcq-pe-collective.md)에 있다.
|
||||
|
||||
본 가이드는 알고리즘 작성자가 **자신이 만져야 할 곳**과 **만지지 않아도 될 곳**을
|
||||
명확히 분리하고, 가장 짧은 경로로 첫 알고리즘을 동작시키는 것을 목표로 한다.
|
||||
|
||||
---
|
||||
|
||||
## 0. 5분 요약
|
||||
|
||||
| 만지는 것 | 위치 |
|
||||
|----------|------|
|
||||
| 알고리즘 모듈 (kernel + 선택적 neighbors) | `src/kernbench/ccl/algorithms/<algo>.py` |
|
||||
| 알고리즘 등록 | `ccl.yaml` |
|
||||
| 호스트 bench (PE 수, 메모리 init, launch, 검증) | `benches/<your_bench>.py` |
|
||||
| (선택) 단위 테스트 | `tests/test_<algo>.py` |
|
||||
|
||||
| 만지지 않는 것 | 위치 |
|
||||
|---------------|------|
|
||||
| TLContext API | `src/kernbench/triton_emu/tl_context.py` (ADR-0022 spec) |
|
||||
| 프레임워크 (topology generators, helpers, mock testing) | `src/kernbench/ccl/` |
|
||||
| PE_IPCQ / PE_DMA 컴포넌트 | `src/kernbench/components/builtin/` |
|
||||
| backend 구현 (install_ipcq) | `src/kernbench/runtime_api/distributed.py` 및 `kernbench/ccl/install.py` |
|
||||
|
||||
흐름:
|
||||
1. 알고리즘 모듈에 `kernel` 작성
|
||||
2. `ccl.yaml`에 entry 등록
|
||||
3. 호스트 bench에서 `install_ipcq` + `launch`
|
||||
4. (선택) mock runtime으로 단위 테스트 (수 ms)
|
||||
5. `kernbench run --bench <name> --verify-data`로 SimPy 검증
|
||||
|
||||
---
|
||||
|
||||
## 1. Hello World — 가장 단순한 send/recv
|
||||
|
||||
각 PE가 자기 데이터를 E 방향 이웃에 한 번 보내고, W 방향에서 한 번 받는
|
||||
가장 단순한 알고리즘이다. 실제 동작 코드는
|
||||
[`src/kernbench/ccl/algorithms/hello_send.py`](../src/kernbench/ccl/algorithms/hello_send.py)
|
||||
에 있다.
|
||||
|
||||
### Step 1: kernel 작성
|
||||
|
||||
새 파일 `src/kernbench/ccl/algorithms/hello_send.py`:
|
||||
|
||||
```python
|
||||
"""Hello world: 자기 데이터를 다음 rank에 보내고 이전 rank에서 받기."""
|
||||
def kernel(t_ptr, n_elem, tl):
|
||||
# 글로벌 rank는 program_id(0/1)에서 계산 (ADR-0022)
|
||||
local_pe = tl.program_id(axis=0)
|
||||
cube_id = tl.program_id(axis=1)
|
||||
pes_per_cube = tl.num_programs(axis=0)
|
||||
rank = cube_id * pes_per_cube + local_pe
|
||||
|
||||
nbytes = n_elem * 2 # f16
|
||||
pe_addr = t_ptr + rank * nbytes
|
||||
|
||||
# 자기 슬라이스를 로드해서 E로 보낸다.
|
||||
src = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
|
||||
tl.send(dir="E", src=src)
|
||||
|
||||
# W 방향에서 받아서 그대로 자기 슬라이스에 store한다.
|
||||
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
|
||||
tl.store(pe_addr, recv)
|
||||
```
|
||||
|
||||
핵심 포인트:
|
||||
|
||||
- **글로벌 rank는 `program_id(axis=0)` + `program_id(axis=1)`에서 계산.** TL에는
|
||||
`tl.rank` / `tl.world_size` 같은 약속되지 않은 확장이 없다. 호스트가
|
||||
`world_size` 같은 알고리즘 파라미터가 필요하면 `torch.launch`의 일반 인자로
|
||||
전달한다.
|
||||
- **`tl.send`는 `TensorHandle`을 받는다.** 핸들의 `addr`/`space`/`shape`/`dtype`/`nbytes`를
|
||||
PE_IPCQ가 읽어 PE_DMA에 IpcqDmaToken을 발행한다.
|
||||
- **`tl.recv`는 `shape`와 `dtype`이 필수.** 반환된 TensorHandle은 IPCQ ring slot을
|
||||
가리키며, `tl.store(pe_addr, recv)`처럼 dst 핸들로 그대로 사용할 수 있다.
|
||||
Phase 2 dma_write replay가 (slot, hbm) 복사를 수행하므로 numpy `.data`를
|
||||
직접 만질 필요가 없다.
|
||||
|
||||
### Step 2: ccl.yaml 등록
|
||||
|
||||
`ccl.yaml`의 `algorithms` 섹션에 entry를 추가한다. (defaults.algorithm은 호스트
|
||||
bench가 `install_ipcq(algorithm=...)`로 명시 전달해도 되므로 꼭 바꿀 필요는 없다.)
|
||||
|
||||
```yaml
|
||||
algorithms:
|
||||
hello_send:
|
||||
module: kernbench.ccl.algorithms.hello_send
|
||||
topology: ring_1d
|
||||
buffer_kind: tcm
|
||||
```
|
||||
|
||||
### Step 3: 호스트 bench 작성
|
||||
|
||||
새 파일 `benches/ccl_hello.py`:
|
||||
|
||||
```python
|
||||
"""Hello-world ring rotation bench (각 PE가 W 이웃의 데이터를 1번 받음)."""
|
||||
import numpy as np
|
||||
|
||||
from kernbench.ccl.algorithms import hello_send
|
||||
from kernbench.policy.placement.dp import DPPolicy
|
||||
|
||||
ALGORITHM = "hello_send"
|
||||
N_ELEM = 8
|
||||
WORLD_SIZE = 8
|
||||
|
||||
|
||||
def run(torch):
|
||||
plan = torch.install_ipcq(algorithm=ALGORITHM)
|
||||
|
||||
a = torch.zeros(
|
||||
(1, WORLD_SIZE * N_ELEM), dtype="f16",
|
||||
dp=DPPolicy(
|
||||
sip="replicate", cube="replicate", pe="column_wise",
|
||||
num_sips=1, num_cubes=1,
|
||||
),
|
||||
name="hello_in",
|
||||
)
|
||||
|
||||
store = torch.engine.memory_store
|
||||
base = a._handle.va_base or a._handle.shards[0].pa
|
||||
nbytes = N_ELEM * 2
|
||||
for r in range(WORLD_SIZE):
|
||||
store.write("hbm", base + r * nbytes,
|
||||
np.full((N_ELEM,), float(r + 1), dtype=np.float16))
|
||||
|
||||
torch.launch(ALGORITHM, hello_send.kernel, a, N_ELEM)
|
||||
|
||||
# rank r은 rank (r-1)%ws의 데이터를 가져야 한다.
|
||||
for r, (sip, cube, pe) in enumerate(plan["rank_to_pe"]):
|
||||
result = store.read("hbm", base + r * nbytes, shape=(N_ELEM,), dtype="f16")
|
||||
prev = float(((r - 1) % WORLD_SIZE) + 1)
|
||||
ok = np.allclose(result, prev)
|
||||
print(f" [{'OK ' if ok else 'FAIL'}] rank {r} got {float(result.mean()):.1f}, "
|
||||
f"expected {prev:.1f}")
|
||||
```
|
||||
|
||||
### Step 4: 단위 테스트 (선택, 강력 추천)
|
||||
|
||||
`tests/test_hello_send.py`:
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
from kernbench.ccl.algorithms.hello_send import kernel
|
||||
from kernbench.ccl.testing import run_kernel_in_mock
|
||||
|
||||
|
||||
def test_hello_send_4_ranks():
|
||||
n_elem = 8
|
||||
inputs = [np.full((n_elem,), float(r + 1), dtype=np.float16) for r in range(4)]
|
||||
|
||||
outputs = run_kernel_in_mock(
|
||||
kernel_fn=kernel,
|
||||
world_size=4,
|
||||
topology="ring_1d",
|
||||
inputs=inputs,
|
||||
kernel_args=(n_elem,),
|
||||
)
|
||||
|
||||
# rank r은 rank (r-1) % 4의 데이터를 받아야 함
|
||||
for r in range(4):
|
||||
assert np.array_equal(outputs[r], inputs[(r - 1) % 4])
|
||||
```
|
||||
|
||||
`run_kernel_in_mock`는 SimPy 없이 순수 Python으로 모든 rank를 동시 실행하므로
|
||||
**ms 단위로 끝난다**. 알고리즘 logic 정합성만 검증.
|
||||
|
||||
### Step 5: 시뮬 검증
|
||||
|
||||
```bash
|
||||
kernbench run --topology topology.yaml --bench ccl_hello --verify-data
|
||||
```
|
||||
|
||||
Phase 1에서 SimPy 시뮬레이션 + MemoryStore 데이터 이동, Phase 2에서 op_log
|
||||
정합성 replay. 호스트 bench의 `print` 검증이 모든 rank에 대해 OK여야 한다.
|
||||
|
||||
---
|
||||
|
||||
## 2. Ring All-Reduce — 두 번째 알고리즘
|
||||
|
||||
조금 더 복잡한 예제. Ring all-reduce는 N-1 라운드 동안 각 PE가 자기 데이터를
|
||||
E로 보내고 W에서 받아 누적한다. 최종적으로 모든 PE가 글로벌 sum을 갖는다.
|
||||
|
||||
실제 동작 코드는 [`src/kernbench/ccl/algorithms/ring_allreduce.py`](../src/kernbench/ccl/algorithms/ring_allreduce.py)
|
||||
참조. 핵심 흐름:
|
||||
|
||||
```python
|
||||
"""Ring all-reduce."""
|
||||
|
||||
|
||||
def kernel(t_ptr, n_elem, world_size, tl):
|
||||
# rank
|
||||
local_pe = tl.program_id(axis=0)
|
||||
cube_id = tl.program_id(axis=1)
|
||||
pes_per_cube = tl.num_programs(axis=0)
|
||||
rank = cube_id * pes_per_cube + local_pe
|
||||
nbytes = n_elem * 2
|
||||
pe_addr = t_ptr + rank * nbytes
|
||||
|
||||
# HBM의 자기 슬라이스를 가리키는 TensorHandle. greenlet 모드에선 .data가
|
||||
# 채워지지만 커널은 .data를 직접 만질 필요가 없다.
|
||||
acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
|
||||
current = acc # 첫 라운드 send 출처
|
||||
|
||||
for _step in range(world_size - 1):
|
||||
tl.send(dir="E", src=current)
|
||||
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
|
||||
# TensorHandle 연산자 오버로드 → MathCmd → PE_MATH 디스패치.
|
||||
# Phase 1은 타이밍만, Phase 2 DataExecutor가 실제 numpy 누적을 수행한다.
|
||||
acc = acc + recv
|
||||
current = recv # 다음 라운드는 직전에 받은 슬롯을 다시 forward
|
||||
|
||||
# 최종 누적값을 자기 슬라이스에 store. 출처는 acc(=PE-local scratch addr)
|
||||
# 이고 dst는 HBM. op_log dma_write가 (scratch, hbm) 복사 정보를 기록하므로
|
||||
# Phase 2가 검증 시점에 HBM[pe_addr]에 정답을 채워준다.
|
||||
tl.store(pe_addr, acc)
|
||||
```
|
||||
|
||||
네 가지 포인트:
|
||||
|
||||
1. **누적은 TensorHandle 연산자**: `acc + recv`는 `MathCmd`를 emit하고
|
||||
PE_MATH로 디스패치된다 — 실제 하드웨어 경로를 거치므로 latency 모델이
|
||||
정확하다. ADR-0020 D3대로 Phase 1은 타이밍만 시뮬레이션하고, Phase 2
|
||||
`DataExecutor`가 op_log를 재실행하면서 numpy 누적을 수행한다.
|
||||
2. **`current = recv`로 forward**: 매 라운드의 send 출처를 직전에 받은 슬롯
|
||||
핸들로 갱신해야 같은 데이터가 ring을 순회하면서 누적이 한 번씩 일어난다.
|
||||
`current = acc`로 두면 누적값이 다시 송출되어 결과가 부풀려진다.
|
||||
3. **`tl.store(pe_addr, acc)` 한 번이면 끝**: 중간에 store→reload 패턴은
|
||||
금지다. acc는 PE-local scratch에 살고, op_log가 (src=scratch, dst=hbm)
|
||||
메타데이터를 기록한다. Phase 2가 math를 먼저 실행해 scratch를 채운 뒤
|
||||
dma_write 스냅샷으로 HBM에 복사한다.
|
||||
4. **`world_size`는 호스트가 명시 전달**: TL은 topology slot 수만 안다 (예:
|
||||
`num_programs(axis=0)`은 cube당 PE 수). 실제 참여하는 CCL group 크기는 bench가
|
||||
알고 호스트→kernel 인자로 넘긴다.
|
||||
|
||||
`ccl.yaml` 등록 + 호스트 bench는 [`benches/ccl_allreduce_tcm.py`](../benches/ccl_allreduce_tcm.py)
|
||||
참조. mock 단위 테스트는 [`tests/test_ccl_mock_runtime.py`](../tests/test_ccl_mock_runtime.py)
|
||||
를 그대로 따라하면 된다 (`kernel_args=(n_elem, world_size)` 인자 형태).
|
||||
|
||||
---
|
||||
|
||||
## 3. neighbors() override — Custom topology
|
||||
|
||||
대부분의 알고리즘은 builtin topology(`ring_1d`, `mesh_2d`, `tree_binary`,
|
||||
`ring_1d_unidir`, `none`)로 충분하다. builtin을 변형하거나 새로 만들고 싶으면
|
||||
알고리즘 모듈에 `neighbors()`를 정의한다.
|
||||
|
||||
### 시그니처
|
||||
|
||||
```python
|
||||
def neighbors(rank: int, world_size: int, neighbor_map: dict[str, int]) -> dict[str, int] | None:
|
||||
"""builtin topology가 만든 neighbor_map을 override.
|
||||
|
||||
Args:
|
||||
neighbor_map: ccl.yaml의 topology 필드가 만든 builtin 매핑.
|
||||
예: ring_1d → {"E": (rank+1)%ws, "W": (rank-1)%ws}
|
||||
mutable dict — 직접 수정 가능.
|
||||
|
||||
Returns:
|
||||
dict: neighbor_map을 override한 결과 (또는 수정한 그 dict)
|
||||
None: override 안 함, neighbor_map 그대로 사용
|
||||
"""
|
||||
return None
|
||||
```
|
||||
|
||||
### Pattern A: builtin을 base로 일부만 수정
|
||||
|
||||
```python
|
||||
def neighbors(rank, world_size, neighbor_map):
|
||||
# 짝수 rank만 W 방향 사용 (홀수 rank는 W 제거)
|
||||
if rank % 2 == 1:
|
||||
neighbor_map.pop("W", None)
|
||||
return neighbor_map
|
||||
```
|
||||
|
||||
### Pattern B: 완전히 새로 작성 (skip-connection ring)
|
||||
|
||||
```python
|
||||
def neighbors(rank, world_size, neighbor_map):
|
||||
# neighbor_map은 무시하고 새로 작성
|
||||
return {"E": (rank + 2) % world_size}
|
||||
```
|
||||
|
||||
### Pattern C: builtin 사용, override 없음
|
||||
|
||||
`neighbors()` 함수를 정의하지 않거나 None을 반환:
|
||||
|
||||
```python
|
||||
def neighbors(rank, world_size, neighbor_map):
|
||||
return None # 명시적으로 builtin 사용
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. PE 커널 API 레퍼런스 (ADR-0023 D4)
|
||||
|
||||
### IPCQ API
|
||||
|
||||
| API | 설명 | Blocking? |
|
||||
|-----|------|-----------|
|
||||
| `tl.send(dir, src=TensorHandle)` | direction으로 데이터 send | Yes (peer slot full 시 wait) |
|
||||
| `tl.send(dir, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)` | 동일, keyword 형태 | Yes |
|
||||
| `tl.recv(dir, shape=..., dtype=...)` | 특정 방향에서 blocking recv | Yes |
|
||||
| `tl.recv(shape=..., dtype=...)` | 4방향 round-robin recv (방향 미지정) | Yes |
|
||||
| `tl.recv_async(dir, shape=..., dtype=...) → RecvFuture` | non-blocking recv | No |
|
||||
| `tl.wait(future)` | non-blocking future 완료 대기 → TensorHandle | Yes |
|
||||
|
||||
### 기존 TL API (ADR-0020/0022, 그대로 사용 가능)
|
||||
|
||||
| API | 설명 |
|
||||
|-----|------|
|
||||
| `tl.load(addr, shape, dtype) → TensorHandle` | DMA read; greenlet 모드에서 `.data`에 ndarray |
|
||||
| `tl.store(addr, handle)` | DMA write — handle.data가 있으면 MemoryStore에 propagate |
|
||||
| `tl.composite(op, ...)` | GEMM/Math compute 비동기 submit |
|
||||
| `tl.program_id(axis=0)` | cube 내 local PE id |
|
||||
| `tl.program_id(axis=1)` | cube id (ADR-0022) |
|
||||
| `tl.num_programs(axis=0/1)` | topology 슬롯 수 (참여 ranks 수가 아님) |
|
||||
|
||||
### `recv` 두 가지 모드
|
||||
|
||||
기본은 `return_slot` (zero-copy): IPCQ slot 주소가 그대로 handle.addr에 들어온다.
|
||||
slot 데이터를 별도 위치로 복사하고 싶으면 `dst_addr` + `dst_space`를 명시:
|
||||
|
||||
```python
|
||||
recv = tl.recv(
|
||||
dir="W", shape=(8,), dtype="f16",
|
||||
dst_addr=my_scratch_addr,
|
||||
dst_space="hbm",
|
||||
)
|
||||
# 이제 recv.addr == my_scratch_addr (copy_to_dst 모드)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Helpers (`kernbench.ccl.helpers`)
|
||||
|
||||
알고리즘 코드를 짧게 유지하기 위한 헬퍼들:
|
||||
|
||||
```python
|
||||
from kernbench.ccl.helpers import chunked, ring_step, tree_step
|
||||
```
|
||||
|
||||
### `chunked(base_addr, n_chunks, n_elem, dtype="f16") → list[Chunk]`
|
||||
|
||||
총 `n_elem` 개의 element를 `n_chunks` 등분한 view 리스트를 반환. 각 `Chunk`는
|
||||
`addr`, `n_elem`, `nbytes` 필드를 가진다.
|
||||
|
||||
```python
|
||||
chunks = chunked(t_ptr, n_chunks=4, n_elem=64, dtype="f16")
|
||||
# chunks[0..3] 각각 16 element view, addr이 연속
|
||||
```
|
||||
|
||||
### `ring_step(rank, step, world_size) → (send_idx, recv_idx)`
|
||||
|
||||
Ring algorithm의 step별 chunk 인덱스 (reduce-scatter / all-gather):
|
||||
|
||||
```python
|
||||
for step in range(world_size - 1):
|
||||
send_idx, recv_idx = ring_step(rank, step, world_size)
|
||||
tl.send(dir="E", src_addr=chunks[send_idx].addr,
|
||||
nbytes=chunks[send_idx].nbytes,
|
||||
shape=(chunks[send_idx].n_elem,), dtype="f16")
|
||||
recv = tl.recv(dir="W", shape=(chunks[recv_idx].n_elem,), dtype="f16")
|
||||
# accumulate ...
|
||||
```
|
||||
|
||||
### `tree_step(rank, world_size) → {"parent": int|None, "children": list[int]}`
|
||||
|
||||
Binary tree의 parent/children rank:
|
||||
|
||||
```python
|
||||
info = tree_step(rank, world_size)
|
||||
if info["parent"] is None:
|
||||
print(f"rank {rank} is the root")
|
||||
for child in info["children"]:
|
||||
...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. 단위 테스트 — Mock Runtime
|
||||
|
||||
`kernbench.ccl.testing.run_kernel_in_mock`은 SimPy를 거치지 않고 알고리즘을
|
||||
빠르게 검증할 수 있다.
|
||||
|
||||
### 기본 사용법
|
||||
|
||||
```python
|
||||
from kernbench.ccl.testing import run_kernel_in_mock
|
||||
from kernbench.ccl.algorithms.my_algo import kernel
|
||||
import numpy as np
|
||||
|
||||
|
||||
def test_my_algo():
|
||||
n_elem = 16
|
||||
inputs = [np.arange(n_elem, dtype="f16") + r for r in range(4)]
|
||||
expected = sum(inputs)
|
||||
|
||||
outputs = run_kernel_in_mock(
|
||||
kernel_fn=kernel,
|
||||
world_size=4,
|
||||
topology="ring_1d",
|
||||
inputs=inputs,
|
||||
kernel_args=(n_elem, 4), # kernel의 (t_ptr 이후) 추가 positional 인자
|
||||
)
|
||||
|
||||
for r in range(4):
|
||||
assert np.allclose(outputs[r], expected, rtol=1e-3)
|
||||
```
|
||||
|
||||
### 동작
|
||||
|
||||
- 4개 rank의 kernel을 greenlet으로 동시 실행
|
||||
- `tl.send/recv`를 in-memory FIFO로 즉시 처리 (DMA, latency 무시)
|
||||
- 각 rank가 마지막에 store한 데이터를 ndarray로 반환
|
||||
|
||||
### 한계
|
||||
|
||||
- latency / 성능 측정 불가 (시뮬레이션이 아님)
|
||||
- PE_DMA, fabric, BW 모델 안 함
|
||||
- 정합성 검증만 가능
|
||||
- 한 cube 안에서 동작하는 가정 — `program_id(axis=1)`은 항상 0
|
||||
|
||||
---
|
||||
|
||||
## 7. 디버깅
|
||||
|
||||
### CCL trace
|
||||
|
||||
```bash
|
||||
KERNBENCH_CCL_TRACE=1 kernbench run --topology topology.yaml \
|
||||
--bench ccl_allreduce_tcm --verify-data
|
||||
```
|
||||
|
||||
각 rank의 send/recv 시점이 stdout에 출력된다:
|
||||
|
||||
```
|
||||
[ccl t=346.4 send] sip0.cube0.pe1 dir=E nbytes=64 seq=0
|
||||
[ccl t=360.4 recv] sip0.cube0.pe2 dir=W nbytes=64
|
||||
...
|
||||
```
|
||||
|
||||
### Pointer dump
|
||||
|
||||
`kernbench.ccl.diagnostics.pointer_dump(engine)`는 모든 PE_IPCQ의 ring buffer
|
||||
상태(`my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`)를 multi-line
|
||||
문자열로 반환한다. hang이 발생하면 어느 rank가 어떤 상태에서 막혔는지 한눈에
|
||||
보인다.
|
||||
|
||||
### Deadlock detection
|
||||
|
||||
매칭되지 않는 send/recv 등으로 SimPy 스케줄이 비면 engine이 `IpcqDeadlock`을
|
||||
던지며 pointer dump를 메시지에 포함시킨다 (ADR-0023 D14 F3). 별도 wait-for graph
|
||||
시각화는 미래 작업.
|
||||
|
||||
---
|
||||
|
||||
## 8. 흔한 실수
|
||||
|
||||
### 1. install 안 된 direction 사용
|
||||
|
||||
ccl.yaml의 `topology: ring_1d`는 E/W만 install한다. N/S 사용 시:
|
||||
|
||||
```python
|
||||
tl.send(dir="N", ...) # → IpcqInvalidDirection 예외
|
||||
```
|
||||
|
||||
해결: `topology: mesh_2d`로 바꾸거나, `neighbors()` override로 N/S 추가.
|
||||
|
||||
### 2. send만 호출하고 recv 없음
|
||||
|
||||
```python
|
||||
def kernel(..., tl):
|
||||
for _ in range(100):
|
||||
tl.send(dir="E", ...)
|
||||
# peer 측 recv 없음 → ring buffer 가득 차면 backpressure → deadlock
|
||||
```
|
||||
|
||||
해결: 모든 send에 짝이 되는 recv가 있어야 한다. 안 그러면 `IpcqDeadlock`이
|
||||
발생한다.
|
||||
|
||||
### 3. dtype/shape 불일치
|
||||
|
||||
기본 모드에서는 dtype/shape mismatch를 검증하지 않는다. 작성자가 직접 보장하거나,
|
||||
PE_IPCQ 노드 attrs에 `strict_validation: true`를 설정해 D14 F2 strict 모드로
|
||||
mismatch를 즉시 잡을 수 있다.
|
||||
|
||||
### 4. round-robin recv의 fairness 가정
|
||||
|
||||
`tl.recv()` (방향 미지정)는 round-robin으로 가져오지만, 도착한 첫 슬롯을 반환한다.
|
||||
**도착 순서를 알 수 없으므로** 알고리즘이 도착 방향에 의존하면 안 된다.
|
||||
필요하면 `tl.recv(dir="N", ...)`처럼 명시.
|
||||
|
||||
### 5. CCL 그룹 크기 가정
|
||||
|
||||
`tl.num_programs(axis=0/1)`은 토폴로지 슬롯 개수이지 CCL group 크기가 아니다.
|
||||
참여하는 rank 수(`world_size`)는 호스트 bench가 알고 있고, kernel 인자로 명시
|
||||
전달해야 한다.
|
||||
|
||||
### 6. 호스트가 send-source 메모리를 도착 전에 덮어씀
|
||||
|
||||
PE_DMA가 송신 시점에 src 데이터를 토큰에 스냅샷해서 in-flight 데이터의 의미가
|
||||
보존된다. 그래도 하나의 PE 안에서 같은 주소를 여러 step에 걸쳐 갱신할 때는
|
||||
direct send 후 다른 step에서 같은 주소를 store해도 안전하다 (token snapshot 덕분).
|
||||
하지만 `tl.send`가 PE_DMA 큐에 enqueue되기 전에 주소를 덮어쓰면 잘못된 데이터가
|
||||
스냅샷된다 — `tl.send`를 먼저, 메모리 변경을 나중에 하는 게 권장.
|
||||
|
||||
---
|
||||
|
||||
## 9. 다음 단계
|
||||
|
||||
- `mesh_2d` / `tree_binary` 같은 다른 topology 활용
|
||||
- recursive halving/doubling 등 더 빠른 알고리즘
|
||||
- `buffer_kind` (tcm/hbm/sram) / `backpressure` (poll/sleep) 모드별 latency 비교
|
||||
- `ccl_ring_allreduce_multicube.py`, `ccl_ring_allreduce_multisip.py`처럼 큰
|
||||
scale의 ring 검증
|
||||
|
||||
새 알고리즘이나 패턴을 추가했다면 PR로 기여해주세요.
|
||||
|
||||
---
|
||||
|
||||
## 참고
|
||||
|
||||
- [ADR-0023](adr/ADR-0023-ipcq-pe-collective.md): IPCQ + PE-level collective 설계
|
||||
- [ADR-0022](adr/ADR-0022-program-id-2d-grid.md): 2D grid program_id (axis=0/1)
|
||||
- [ADR-0020](adr/ADR-0020-data-execution-two-pass.md): 2-pass data execution
|
||||
- [ADR-0021](adr/ADR-0021-pe-pipeline-refactor.md): PE pipeline refactor
|
||||
|
||||
기존 알고리즘 예제:
|
||||
|
||||
- [`src/kernbench/ccl/algorithms/hello_send.py`](../src/kernbench/ccl/algorithms/hello_send.py) — 가장 단순한 send/recv
|
||||
- [`src/kernbench/ccl/algorithms/ring_allreduce.py`](../src/kernbench/ccl/algorithms/ring_allreduce.py) — ring all-reduce
|
||||
- [`src/kernbench/ccl/algorithms/mesh_allreduce.py`](../src/kernbench/ccl/algorithms/mesh_allreduce.py) — 2D mesh all-reduce
|
||||
- [`src/kernbench/ccl/algorithms/tree_allreduce.py`](../src/kernbench/ccl/algorithms/tree_allreduce.py) — binary tree all-reduce
|
||||
Reference in New Issue
Block a user