Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

Major changes:

PE-level IPCQ infrastructure:
- New PE_IPCQ component: ring-buffer control plane with 4-direction
  neighbor mapping, head/tail pointers, backpressure (poll/sleep).
- PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA,
  including in-flight data snapshot (D9) and op_log recording at
  outbound time for Phase 2 replay correctness.
- IpcqDmaToken piggyback model: data + metadata travel together,
  atomic visibility at receiver (invariant I6).
- Credit return fast path: bottleneck-BW latency, no fabric vc_comm.

Phase 2 data execution (ADR-0020 integration):
- op_log extended: DmaWriteCmd now captures src_space/src_addr for
  Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time.
- DataExecutor replays dma_write + ipcq_copy in t_start order.
- Engine._flush_data_phase: incremental cursor-based replay after
  each engine.wait() so host reads see post-Phase-2 data.
- KernelRunner Phase 1 writes disabled when op_log is active to
  prevent stale data from corrupting the MemoryStore snapshot.

TLContext / kernel API:
- tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype),
  tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode.
- TensorHandle operator overloading (add/sub/mul/div) via thread-local
  active TLContext → MathCmd dispatch through PE_MATH.
- PE-local scratch allocator for math output handles.
- tl.load returns space="hbm" handles for correct Phase 2 addressing.
- Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv.

Unified ccl_allreduce bench (PyTorch-compat host code):
- Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch)
  split matching real PyTorch DDP worker pattern.
- torch.distributed facade: init_process_group, get_world_size,
  get_rank, get_backend, all_reduce, barrier — only real PyTorch names.
- AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches
  kernel via tensor shard metadata (n_elem from shards[0].nbytes).
- world_size derived from topology spec (sips × cubes × pes_per_cube)
  with optional algorithm-level override in ccl.yaml.

Tensor API (PyTorch-compat surface):
- Tensor.numpy(): gather-aware (all shards via VA-based addressing).
- Tensor.copy_(source): scatter from host tensor into sharded target.
- RuntimeContext.from_numpy(arr): host-side staging tensor.
- Tensor.data property fixed to use numpy() (was shards[0]-only).

Algorithm modules moved to src/kernbench/ccl/algorithms/:
- ring_allreduce, mesh_allreduce, tree_allreduce, hello_send.
- Each module exports kernel_args(world_size, n_elem) helper.
- ccl.yaml module paths updated to kernbench.ccl.algorithms.*.

Dead code removed:
- 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.).
- _run_ccl_bench greenlet-per-SIP scheduler.
- benches.loader.is_ccl_bench + run_rank detection.
- benches/ccl/ directory.

Tests:
- New test_ccl_allreduce_matrix.py: 7 parametrized cases
  (ring×3 buffers, ring 8/16, mesh 4, tree 7).
- New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests.
- Existing tests updated for new import paths + world_size_override.

Docs:
- Korean ccl-author-guide.md and ADR-0023 paths updated.
- New English versions: ccl-author-guide.en.md, ADR-0023.en.md.

502 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-12 19:36:59 -07:00
parent ff2c677a9c
commit 998cc85762
60 changed files with 9196 additions and 80 deletions
+234
View File
@@ -0,0 +1,234 @@
"""IPCQ schemas and exceptions (ADR-0023 D2.5, D12, D14 F1).
This module contains the data structures and exceptions used by the
PE-level IPCQ collective communication infrastructure. The host-facing
sideband fan-out message ``IpcqInitMsg`` lives in
``kernbench.runtime_api.kernel`` (alongside other fabric messages),
while all internal token / metadata / command schemas are kept here.
Layering:
PE_CPU --IpcqRequest(IpcqSendCmd|IpcqRecvCmd)--> PE_IPCQ
PE_IPCQ --IpcqDmaToken--> PE_DMA (vc_comm)
PE_DMA --IpcqMetaArrival--> PE_IPCQ (atomic, D9)
PE_IPCQ --IpcqCreditMetadata--> peer PE_IPCQ (fast path, D9)
See ADR-0023 for the full design.
"""
from __future__ import annotations
from dataclasses import dataclass, field
from typing import TYPE_CHECKING, Any, Union
if TYPE_CHECKING:
import simpy
# ── D14 F1: invalid direction exception ──────────────────────────────
class IpcqInvalidDirection(ValueError):
"""Raised when a kernel calls tl.send/recv with a direction that
has no neighbor installed for this PE."""
# ── D2.5: IpcqEndpoint ───────────────────────────────────────────────
@dataclass(frozen=True)
class IpcqEndpoint:
"""송신 측이 peer's rx_buffer 주소를 계산하기 위해 필요한 모든 정보 (D2.5).
Sender PE_IPCQ uses this to compute the destination PA for its DMA
write into the peer's rx ring buffer slot:
slot_idx = sender.my_head % peer.n_slots
dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size
"""
sip: int # destination SIP
cube: int # destination cube
pe: int # destination PE (cube-local index)
buffer_kind: str # "tcm" | "hbm" | "sram"
rx_base_pa: int # peer rx_buffer base PA (PhysAddr.encode())
rx_base_va: int # peer rx_buffer base VA (optional, MMU)
n_slots: int # peer ring depth (wrap-around modulo)
slot_size: int # peer slot size (offset multiplier)
# ── D12: IpcqInitEntry (used by IpcqInitMsg in kernel.py) ────────────
@dataclass(frozen=True)
class IpcqInitEntry:
"""One direction's neighbor entry that backend installs into a PE_IPCQ
via IpcqInitMsg (kernbench.runtime_api.kernel.IpcqInitMsg, D12).
"""
direction: str # "N" | "S" | "E" | "W"
peer: IpcqEndpoint # see D2.5
my_rx_base_pa: int # this PE's own rx_buffer base
my_rx_base_va: int # this PE's own rx_buffer base VA (optional)
n_slots: int # this PE's ring depth
slot_size: int # this PE's slot size
# Credit fast path channel (D9).
# Contract: must be a simpy.Store instance dedicated to receiving
# IpcqCreditMetadata objects only. Backend wires it once at init time
# and the receiving PE_IPCQ owns its consumer side; the sender (peer's
# PE_IPCQ) puts IpcqCreditMetadata directly into this store via
# _delayed_credit_send. Do not put any other object type.
peer_credit_store: "simpy.Store"
# ── D12: IpcqSendCmd (PE_CPU → PE_IPCQ) ──────────────────────────────
@dataclass(frozen=True)
class IpcqSendCmd:
"""tl.send command issued by the kernel to PE_IPCQ."""
direction: str # "N" | "S" | "E" | "W"
src_addr: int # source data address (TCM/HBM/SRAM)
src_space: str # "tcm" | "hbm" | "sram"
nbytes: int
shape: tuple[int, ...] # data shape (op_log + MemoryStore use)
dtype: str
handle_id: str # completion tracking
data_op: bool = True # ADR-0020 op_log recording flag
# ── D12: IpcqRecvCmd (PE_CPU → PE_IPCQ) ──────────────────────────────
@dataclass(frozen=True)
class IpcqRecvCmd:
"""tl.recv command issued by the kernel to PE_IPCQ.
Two modes (recv_mode):
"return_slot" — return slot address as-is (default, zero-copy).
Kernel uses the slot memory directly.
"copy_to_dst" — copy slot data to dst_addr, then return.
"""
direction: str | None # None → round-robin (weak fairness, D4)
shape: tuple[int, ...]
dtype: str
handle_id: str
recv_mode: str = "return_slot"
dst_addr: int = 0 # used only when recv_mode == "copy_to_dst"
dst_space: str = "" # used only when recv_mode == "copy_to_dst"
blocking: bool = True
data_op: bool = True
# ── D12: IpcqDmaToken (PE_IPCQ → PE_DMA, vc_comm) ───────────────────
@dataclass
class IpcqDmaToken:
"""Token sent from PE_IPCQ to PE_DMA (vc_comm channel) carrying both
the data move request and the piggyback metadata (ADR-0023 D9).
Receiving PE_DMA processes this atomically (I6 MUST):
1. MemoryStore.write(dst_endpoint.buffer_kind, dst_addr, data)
2. Forward IpcqMetaArrival(token=self) to peer PE_IPCQ
No yield is allowed between the two steps.
The ``data`` field is a snapshot taken by the sender's PE_DMA at the
moment the send is issued. This preserves "in-flight data" semantics:
if the sender mutates its source memory after issuing the send but
before arrival, the receiver still gets the snapshot. The snapshot is
None for control-only tokens (e.g. credit-only updates).
"""
# ── Data movement (single-hop DMA write) ──
src_addr: int
src_space: str
dst_addr: int # already-computed peer rx slot PA
dst_endpoint: IpcqEndpoint # routing target (sip/cube/pe) + buffer_kind
nbytes: int
handle_id: str # completion notify back to sender PE_IPCQ
# Optional shape/dtype carried for op_log + MemoryStore convenience.
shape: tuple[int, ...] = ()
dtype: str = "f16"
# In-flight data snapshot (sender PE_DMA captures this at send time).
data: Any = None
# ── Piggyback metadata (D9) ──
sender_seq: int = 0 # monotonic; receiver updates peer_head_cache
src_sip: int = 0
src_cube: int = 0
src_pe: int = 0
src_direction: str = "E" # sender-side direction; receiver maps to its own
data_op: bool = True
# ── D12: IpcqMetaArrival (PE_DMA → PE_IPCQ, intra-PE wire) ──────────
@dataclass
class IpcqMetaArrival:
"""Posted by receiving PE_DMA into the destination PE's PE_IPCQ inbox
in the same SimPy step as the MemoryStore.write (D9, I6 MUST).
The receiver PE_IPCQ uses ``token.sender_seq`` to update its
peer_head_cache for the corresponding direction.
"""
token: IpcqDmaToken
# ── D12: IpcqCreditMetadata (PE_IPCQ → peer PE_IPCQ, fast path) ─────
@dataclass(frozen=True)
class IpcqCreditMetadata:
"""Credit return — recv-side → send-side fast path (D9).
Sent by ``PeIpcqComponent._delayed_credit_send`` after a
bottleneck-BW based latency, putting the metadata directly into
the peer's pre-wired credit store (no fabric routing).
"""
consumer_seq: int # my_tail at recv side (new tail value)
src_sip: int # which peer is sending the credit
src_cube: int
src_pe: int
src_direction: str # sender-side direction (peer maps to its own)
# ── Request wrapper (PE_CPU → PE_IPCQ) ───────────────────────────────
@dataclass
class IpcqRequest:
"""Wrapper carrying an IpcqSendCmd or IpcqRecvCmd plus a SimPy completion
event. Posted by PE_CPU into PE_IPCQ's inbox; PE_IPCQ calls
``done.succeed()`` when the request is fully processed.
For recv requests, the result (slot address, direction, dtype, shape)
is written into ``result_data`` so the caller can read it after wait.
"""
command: "IpcqSendCmd | IpcqRecvCmd"
done: "simpy.Event"
result_data: dict[str, Any] = field(default_factory=dict)
# ── RecvFuture (kernel ↔ runner handshake for tl.recv_async / tl.wait) ─
@dataclass
class RecvFuture:
"""Opaque future returned by ``tl.recv_async``.
The KernelRunner attaches a SimPy event and the IpcqRequest in the
background; ``tl.wait(future)`` switches back to the runner which
yields on the event and resolves the result into a TensorHandle.
"""
cmd: "IpcqRecvCmd"
request: Any = None # IpcqRequest (set by runner)
event: Any = None # simpy.Event (set by runner)
resolved: bool = False
result: Any = None # cached TensorHandle after wait()
+1
View File
@@ -33,6 +33,7 @@ class TensorHandle:
dtype: str
nbytes: int # total byte size
data: object = None # reserved for validate mode
space: str = "tcm" # MemoryStore space ("tcm" | "hbm" | "sram")
@dataclass(frozen=True)