Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)
Major changes:
PE-level IPCQ infrastructure:
- New PE_IPCQ component: ring-buffer control plane with 4-direction
neighbor mapping, head/tail pointers, backpressure (poll/sleep).
- PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA,
including in-flight data snapshot (D9) and op_log recording at
outbound time for Phase 2 replay correctness.
- IpcqDmaToken piggyback model: data + metadata travel together,
atomic visibility at receiver (invariant I6).
- Credit return fast path: bottleneck-BW latency, no fabric vc_comm.
Phase 2 data execution (ADR-0020 integration):
- op_log extended: DmaWriteCmd now captures src_space/src_addr for
Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time.
- DataExecutor replays dma_write + ipcq_copy in t_start order.
- Engine._flush_data_phase: incremental cursor-based replay after
each engine.wait() so host reads see post-Phase-2 data.
- KernelRunner Phase 1 writes disabled when op_log is active to
prevent stale data from corrupting the MemoryStore snapshot.
TLContext / kernel API:
- tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype),
tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode.
- TensorHandle operator overloading (add/sub/mul/div) via thread-local
active TLContext → MathCmd dispatch through PE_MATH.
- PE-local scratch allocator for math output handles.
- tl.load returns space="hbm" handles for correct Phase 2 addressing.
- Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv.
Unified ccl_allreduce bench (PyTorch-compat host code):
- Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch)
split matching real PyTorch DDP worker pattern.
- torch.distributed facade: init_process_group, get_world_size,
get_rank, get_backend, all_reduce, barrier — only real PyTorch names.
- AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches
kernel via tensor shard metadata (n_elem from shards[0].nbytes).
- world_size derived from topology spec (sips × cubes × pes_per_cube)
with optional algorithm-level override in ccl.yaml.
Tensor API (PyTorch-compat surface):
- Tensor.numpy(): gather-aware (all shards via VA-based addressing).
- Tensor.copy_(source): scatter from host tensor into sharded target.
- RuntimeContext.from_numpy(arr): host-side staging tensor.
- Tensor.data property fixed to use numpy() (was shards[0]-only).
Algorithm modules moved to src/kernbench/ccl/algorithms/:
- ring_allreduce, mesh_allreduce, tree_allreduce, hello_send.
- Each module exports kernel_args(world_size, n_elem) helper.
- ccl.yaml module paths updated to kernbench.ccl.algorithms.*.
Dead code removed:
- 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.).
- _run_ccl_bench greenlet-per-SIP scheduler.
- benches.loader.is_ccl_bench + run_rank detection.
- benches/ccl/ directory.
Tests:
- New test_ccl_allreduce_matrix.py: 7 parametrized cases
(ring×3 buffers, ring 8/16, mesh 4, tree 7).
- New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests.
- Existing tests updated for new import paths + world_size_override.
Docs:
- Korean ccl-author-guide.md and ADR-0023 paths updated.
- New English versions: ccl-author-guide.en.md, ADR-0023.en.md.
502 tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,234 @@
|
||||
"""IPCQ schemas and exceptions (ADR-0023 D2.5, D12, D14 F1).
|
||||
|
||||
This module contains the data structures and exceptions used by the
|
||||
PE-level IPCQ collective communication infrastructure. The host-facing
|
||||
sideband fan-out message ``IpcqInitMsg`` lives in
|
||||
``kernbench.runtime_api.kernel`` (alongside other fabric messages),
|
||||
while all internal token / metadata / command schemas are kept here.
|
||||
|
||||
Layering:
|
||||
PE_CPU --IpcqRequest(IpcqSendCmd|IpcqRecvCmd)--> PE_IPCQ
|
||||
PE_IPCQ --IpcqDmaToken--> PE_DMA (vc_comm)
|
||||
PE_DMA --IpcqMetaArrival--> PE_IPCQ (atomic, D9)
|
||||
PE_IPCQ --IpcqCreditMetadata--> peer PE_IPCQ (fast path, D9)
|
||||
|
||||
See ADR-0023 for the full design.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass, field
|
||||
from typing import TYPE_CHECKING, Any, Union
|
||||
|
||||
if TYPE_CHECKING:
|
||||
import simpy
|
||||
|
||||
|
||||
# ── D14 F1: invalid direction exception ──────────────────────────────
|
||||
|
||||
|
||||
class IpcqInvalidDirection(ValueError):
|
||||
"""Raised when a kernel calls tl.send/recv with a direction that
|
||||
has no neighbor installed for this PE."""
|
||||
|
||||
|
||||
# ── D2.5: IpcqEndpoint ───────────────────────────────────────────────
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class IpcqEndpoint:
|
||||
"""송신 측이 peer's rx_buffer 주소를 계산하기 위해 필요한 모든 정보 (D2.5).
|
||||
|
||||
Sender PE_IPCQ uses this to compute the destination PA for its DMA
|
||||
write into the peer's rx ring buffer slot:
|
||||
|
||||
slot_idx = sender.my_head % peer.n_slots
|
||||
dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size
|
||||
"""
|
||||
|
||||
sip: int # destination SIP
|
||||
cube: int # destination cube
|
||||
pe: int # destination PE (cube-local index)
|
||||
buffer_kind: str # "tcm" | "hbm" | "sram"
|
||||
rx_base_pa: int # peer rx_buffer base PA (PhysAddr.encode())
|
||||
rx_base_va: int # peer rx_buffer base VA (optional, MMU)
|
||||
n_slots: int # peer ring depth (wrap-around modulo)
|
||||
slot_size: int # peer slot size (offset multiplier)
|
||||
|
||||
|
||||
# ── D12: IpcqInitEntry (used by IpcqInitMsg in kernel.py) ────────────
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class IpcqInitEntry:
|
||||
"""One direction's neighbor entry that backend installs into a PE_IPCQ
|
||||
via IpcqInitMsg (kernbench.runtime_api.kernel.IpcqInitMsg, D12).
|
||||
"""
|
||||
|
||||
direction: str # "N" | "S" | "E" | "W"
|
||||
peer: IpcqEndpoint # see D2.5
|
||||
my_rx_base_pa: int # this PE's own rx_buffer base
|
||||
my_rx_base_va: int # this PE's own rx_buffer base VA (optional)
|
||||
n_slots: int # this PE's ring depth
|
||||
slot_size: int # this PE's slot size
|
||||
# Credit fast path channel (D9).
|
||||
# Contract: must be a simpy.Store instance dedicated to receiving
|
||||
# IpcqCreditMetadata objects only. Backend wires it once at init time
|
||||
# and the receiving PE_IPCQ owns its consumer side; the sender (peer's
|
||||
# PE_IPCQ) puts IpcqCreditMetadata directly into this store via
|
||||
# _delayed_credit_send. Do not put any other object type.
|
||||
peer_credit_store: "simpy.Store"
|
||||
|
||||
|
||||
# ── D12: IpcqSendCmd (PE_CPU → PE_IPCQ) ──────────────────────────────
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class IpcqSendCmd:
|
||||
"""tl.send command issued by the kernel to PE_IPCQ."""
|
||||
|
||||
direction: str # "N" | "S" | "E" | "W"
|
||||
src_addr: int # source data address (TCM/HBM/SRAM)
|
||||
src_space: str # "tcm" | "hbm" | "sram"
|
||||
nbytes: int
|
||||
shape: tuple[int, ...] # data shape (op_log + MemoryStore use)
|
||||
dtype: str
|
||||
handle_id: str # completion tracking
|
||||
data_op: bool = True # ADR-0020 op_log recording flag
|
||||
|
||||
|
||||
# ── D12: IpcqRecvCmd (PE_CPU → PE_IPCQ) ──────────────────────────────
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class IpcqRecvCmd:
|
||||
"""tl.recv command issued by the kernel to PE_IPCQ.
|
||||
|
||||
Two modes (recv_mode):
|
||||
"return_slot" — return slot address as-is (default, zero-copy).
|
||||
Kernel uses the slot memory directly.
|
||||
"copy_to_dst" — copy slot data to dst_addr, then return.
|
||||
"""
|
||||
|
||||
direction: str | None # None → round-robin (weak fairness, D4)
|
||||
shape: tuple[int, ...]
|
||||
dtype: str
|
||||
handle_id: str
|
||||
recv_mode: str = "return_slot"
|
||||
dst_addr: int = 0 # used only when recv_mode == "copy_to_dst"
|
||||
dst_space: str = "" # used only when recv_mode == "copy_to_dst"
|
||||
blocking: bool = True
|
||||
data_op: bool = True
|
||||
|
||||
|
||||
# ── D12: IpcqDmaToken (PE_IPCQ → PE_DMA, vc_comm) ───────────────────
|
||||
|
||||
|
||||
@dataclass
|
||||
class IpcqDmaToken:
|
||||
"""Token sent from PE_IPCQ to PE_DMA (vc_comm channel) carrying both
|
||||
the data move request and the piggyback metadata (ADR-0023 D9).
|
||||
|
||||
Receiving PE_DMA processes this atomically (I6 MUST):
|
||||
1. MemoryStore.write(dst_endpoint.buffer_kind, dst_addr, data)
|
||||
2. Forward IpcqMetaArrival(token=self) to peer PE_IPCQ
|
||||
No yield is allowed between the two steps.
|
||||
|
||||
The ``data`` field is a snapshot taken by the sender's PE_DMA at the
|
||||
moment the send is issued. This preserves "in-flight data" semantics:
|
||||
if the sender mutates its source memory after issuing the send but
|
||||
before arrival, the receiver still gets the snapshot. The snapshot is
|
||||
None for control-only tokens (e.g. credit-only updates).
|
||||
"""
|
||||
|
||||
# ── Data movement (single-hop DMA write) ──
|
||||
src_addr: int
|
||||
src_space: str
|
||||
dst_addr: int # already-computed peer rx slot PA
|
||||
dst_endpoint: IpcqEndpoint # routing target (sip/cube/pe) + buffer_kind
|
||||
nbytes: int
|
||||
handle_id: str # completion notify back to sender PE_IPCQ
|
||||
# Optional shape/dtype carried for op_log + MemoryStore convenience.
|
||||
shape: tuple[int, ...] = ()
|
||||
dtype: str = "f16"
|
||||
# In-flight data snapshot (sender PE_DMA captures this at send time).
|
||||
data: Any = None
|
||||
|
||||
# ── Piggyback metadata (D9) ──
|
||||
sender_seq: int = 0 # monotonic; receiver updates peer_head_cache
|
||||
src_sip: int = 0
|
||||
src_cube: int = 0
|
||||
src_pe: int = 0
|
||||
src_direction: str = "E" # sender-side direction; receiver maps to its own
|
||||
|
||||
data_op: bool = True
|
||||
|
||||
|
||||
# ── D12: IpcqMetaArrival (PE_DMA → PE_IPCQ, intra-PE wire) ──────────
|
||||
|
||||
|
||||
@dataclass
|
||||
class IpcqMetaArrival:
|
||||
"""Posted by receiving PE_DMA into the destination PE's PE_IPCQ inbox
|
||||
in the same SimPy step as the MemoryStore.write (D9, I6 MUST).
|
||||
|
||||
The receiver PE_IPCQ uses ``token.sender_seq`` to update its
|
||||
peer_head_cache for the corresponding direction.
|
||||
"""
|
||||
|
||||
token: IpcqDmaToken
|
||||
|
||||
|
||||
# ── D12: IpcqCreditMetadata (PE_IPCQ → peer PE_IPCQ, fast path) ─────
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class IpcqCreditMetadata:
|
||||
"""Credit return — recv-side → send-side fast path (D9).
|
||||
|
||||
Sent by ``PeIpcqComponent._delayed_credit_send`` after a
|
||||
bottleneck-BW based latency, putting the metadata directly into
|
||||
the peer's pre-wired credit store (no fabric routing).
|
||||
"""
|
||||
|
||||
consumer_seq: int # my_tail at recv side (new tail value)
|
||||
src_sip: int # which peer is sending the credit
|
||||
src_cube: int
|
||||
src_pe: int
|
||||
src_direction: str # sender-side direction (peer maps to its own)
|
||||
|
||||
|
||||
# ── Request wrapper (PE_CPU → PE_IPCQ) ───────────────────────────────
|
||||
|
||||
|
||||
@dataclass
|
||||
class IpcqRequest:
|
||||
"""Wrapper carrying an IpcqSendCmd or IpcqRecvCmd plus a SimPy completion
|
||||
event. Posted by PE_CPU into PE_IPCQ's inbox; PE_IPCQ calls
|
||||
``done.succeed()`` when the request is fully processed.
|
||||
|
||||
For recv requests, the result (slot address, direction, dtype, shape)
|
||||
is written into ``result_data`` so the caller can read it after wait.
|
||||
"""
|
||||
|
||||
command: "IpcqSendCmd | IpcqRecvCmd"
|
||||
done: "simpy.Event"
|
||||
result_data: dict[str, Any] = field(default_factory=dict)
|
||||
|
||||
|
||||
# ── RecvFuture (kernel ↔ runner handshake for tl.recv_async / tl.wait) ─
|
||||
|
||||
|
||||
@dataclass
|
||||
class RecvFuture:
|
||||
"""Opaque future returned by ``tl.recv_async``.
|
||||
|
||||
The KernelRunner attaches a SimPy event and the IpcqRequest in the
|
||||
background; ``tl.wait(future)`` switches back to the runner which
|
||||
yields on the event and resolves the result into a TensorHandle.
|
||||
"""
|
||||
|
||||
cmd: "IpcqRecvCmd"
|
||||
request: Any = None # IpcqRequest (set by runner)
|
||||
event: Any = None # simpy.Event (set by runner)
|
||||
resolved: bool = False
|
||||
result: Any = None # cached TensorHandle after wait()
|
||||
@@ -33,6 +33,7 @@ class TensorHandle:
|
||||
dtype: str
|
||||
nbytes: int # total byte size
|
||||
data: object = None # reserved for validate mode
|
||||
space: str = "tcm" # MemoryStore space ("tcm" | "hbm" | "sram")
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
|
||||
Reference in New Issue
Block a user