Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

Major changes: PE-level IPCQ infrastructure: - New PE_IPCQ component: ring-buffer control plane with 4-direction neighbor mapping, head/tail pointers, backpressure (poll/sleep). - PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA, including in-flight data snapshot (D9) and op_log recording at outbound time for Phase 2 replay correctness. - IpcqDmaToken piggyback model: data + metadata travel together, atomic visibility at receiver (invariant I6). - Credit return fast path: bottleneck-BW latency, no fabric vc_comm. Phase 2 data execution (ADR-0020 integration): - op_log extended: DmaWriteCmd now captures src_space/src_addr for Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time. - DataExecutor replays dma_write + ipcq_copy in t_start order. - Engine._flush_data_phase: incremental cursor-based replay after each engine.wait() so host reads see post-Phase-2 data. - KernelRunner Phase 1 writes disabled when op_log is active to prevent stale data from corrupting the MemoryStore snapshot. TLContext / kernel API: - tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype), tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode. - TensorHandle operator overloading (add/sub/mul/div) via thread-local active TLContext → MathCmd dispatch through PE_MATH. - PE-local scratch allocator for math output handles. - tl.load returns space="hbm" handles for correct Phase 2 addressing. - Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv. Unified ccl_allreduce bench (PyTorch-compat host code): - Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch) split matching real PyTorch DDP worker pattern. - torch.distributed facade: init_process_group, get_world_size, get_rank, get_backend, all_reduce, barrier — only real PyTorch names. - AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches kernel via tensor shard metadata (n_elem from shards[0].nbytes). - world_size derived from topology spec (sips × cubes × pes_per_cube) with optional algorithm-level override in ccl.yaml. Tensor API (PyTorch-compat surface): - Tensor.numpy(): gather-aware (all shards via VA-based addressing). - Tensor.copy_(source): scatter from host tensor into sharded target. - RuntimeContext.from_numpy(arr): host-side staging tensor. - Tensor.data property fixed to use numpy() (was shards[0]-only). Algorithm modules moved to src/kernbench/ccl/algorithms/: - ring_allreduce, mesh_allreduce, tree_allreduce, hello_send. - Each module exports kernel_args(world_size, n_elem) helper. - ccl.yaml module paths updated to kernbench.ccl.algorithms.*. Dead code removed: - 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.). - _run_ccl_bench greenlet-per-SIP scheduler. - benches.loader.is_ccl_bench + run_rank detection. - benches/ccl/ directory. Tests: - New test_ccl_allreduce_matrix.py: 7 parametrized cases (ring×3 buffers, ring 8/16, mesh 4, tree 7). - New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests. - Existing tests updated for new import paths + world_size_override. Docs: - Korean ccl-author-guide.md and ADR-0023 paths updated. - New English versions: ccl-author-guide.en.md, ADR-0023.en.md. 502 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 19:36:59 -07:00
parent ff2c677a9c
commit 998cc85762
60 changed files with 9196 additions and 80 deletions
@@ -0,0 +1,234 @@
+"""IPCQ schemas and exceptions (ADR-0023 D2.5, D12, D14 F1).
+
+This module contains the data structures and exceptions used by the
+PE-level IPCQ collective communication infrastructure. The host-facing
+sideband fan-out message ``IpcqInitMsg`` lives in
+``kernbench.runtime_api.kernel`` (alongside other fabric messages),
+while all internal token / metadata / command schemas are kept here.
+
+Layering:
+    PE_CPU       --IpcqRequest(IpcqSendCmd|IpcqRecvCmd)--> PE_IPCQ
+    PE_IPCQ      --IpcqDmaToken-->                         PE_DMA (vc_comm)
+    PE_DMA       --IpcqMetaArrival-->                      PE_IPCQ (atomic, D9)
+    PE_IPCQ      --IpcqCreditMetadata-->                   peer PE_IPCQ (fast path, D9)
+
+See ADR-0023 for the full design.
+"""
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+from typing import TYPE_CHECKING, Any, Union
+
+if TYPE_CHECKING:
+    import simpy
+
+
+# ── D14 F1: invalid direction exception ──────────────────────────────
+
+
+class IpcqInvalidDirection(ValueError):
+    """Raised when a kernel calls tl.send/recv with a direction that
+    has no neighbor installed for this PE."""
+
+
+# ── D2.5: IpcqEndpoint ───────────────────────────────────────────────
+
+
+@dataclass(frozen=True)
+class IpcqEndpoint:
+    """송신 측이 peer's rx_buffer 주소를 계산하기 위해 필요한 모든 정보 (D2.5).
+
+    Sender PE_IPCQ uses this to compute the destination PA for its DMA
+    write into the peer's rx ring buffer slot:
+
+        slot_idx = sender.my_head % peer.n_slots
+        dst_pa   = peer.rx_base_pa + slot_idx * peer.slot_size
+    """
+
+    sip: int                     # destination SIP
+    cube: int                    # destination cube
+    pe: int                      # destination PE (cube-local index)
+    buffer_kind: str             # "tcm" | "hbm" | "sram"
+    rx_base_pa: int              # peer rx_buffer base PA (PhysAddr.encode())
+    rx_base_va: int              # peer rx_buffer base VA (optional, MMU)
+    n_slots: int                 # peer ring depth (wrap-around modulo)
+    slot_size: int               # peer slot size (offset multiplier)
+
+
+# ── D12: IpcqInitEntry (used by IpcqInitMsg in kernel.py) ────────────
+
+
+@dataclass(frozen=True)
+class IpcqInitEntry:
+    """One direction's neighbor entry that backend installs into a PE_IPCQ
+    via IpcqInitMsg (kernbench.runtime_api.kernel.IpcqInitMsg, D12).
+    """
+
+    direction: str               # "N" | "S" | "E" | "W"
+    peer: IpcqEndpoint           # see D2.5
+    my_rx_base_pa: int           # this PE's own rx_buffer base
+    my_rx_base_va: int           # this PE's own rx_buffer base VA (optional)
+    n_slots: int                 # this PE's ring depth
+    slot_size: int               # this PE's slot size
+    # Credit fast path channel (D9).
+    # Contract: must be a simpy.Store instance dedicated to receiving
+    # IpcqCreditMetadata objects only. Backend wires it once at init time
+    # and the receiving PE_IPCQ owns its consumer side; the sender (peer's
+    # PE_IPCQ) puts IpcqCreditMetadata directly into this store via
+    # _delayed_credit_send. Do not put any other object type.
+    peer_credit_store: "simpy.Store"
+
+
+# ── D12: IpcqSendCmd (PE_CPU → PE_IPCQ) ──────────────────────────────
+
+
+@dataclass(frozen=True)
+class IpcqSendCmd:
+    """tl.send command issued by the kernel to PE_IPCQ."""
+
+    direction: str               # "N" | "S" | "E" | "W"
+    src_addr: int                # source data address (TCM/HBM/SRAM)
+    src_space: str               # "tcm" | "hbm" | "sram"
+    nbytes: int
+    shape: tuple[int, ...]       # data shape (op_log + MemoryStore use)
+    dtype: str
+    handle_id: str               # completion tracking
+    data_op: bool = True         # ADR-0020 op_log recording flag
+
+
+# ── D12: IpcqRecvCmd (PE_CPU → PE_IPCQ) ──────────────────────────────
+
+
+@dataclass(frozen=True)
+class IpcqRecvCmd:
+    """tl.recv command issued by the kernel to PE_IPCQ.
+
+    Two modes (recv_mode):
+        "return_slot" — return slot address as-is (default, zero-copy).
+                        Kernel uses the slot memory directly.
+        "copy_to_dst" — copy slot data to dst_addr, then return.
+    """
+
+    direction: str | None        # None → round-robin (weak fairness, D4)
+    shape: tuple[int, ...]
+    dtype: str
+    handle_id: str
+    recv_mode: str = "return_slot"
+    dst_addr: int = 0            # used only when recv_mode == "copy_to_dst"
+    dst_space: str = ""          # used only when recv_mode == "copy_to_dst"
+    blocking: bool = True
+    data_op: bool = True
+
+
+# ── D12: IpcqDmaToken (PE_IPCQ → PE_DMA, vc_comm) ───────────────────
+
+
+@dataclass
+class IpcqDmaToken:
+    """Token sent from PE_IPCQ to PE_DMA (vc_comm channel) carrying both
+    the data move request and the piggyback metadata (ADR-0023 D9).
+
+    Receiving PE_DMA processes this atomically (I6 MUST):
+        1. MemoryStore.write(dst_endpoint.buffer_kind, dst_addr, data)
+        2. Forward IpcqMetaArrival(token=self) to peer PE_IPCQ
+    No yield is allowed between the two steps.
+
+    The ``data`` field is a snapshot taken by the sender's PE_DMA at the
+    moment the send is issued. This preserves "in-flight data" semantics:
+    if the sender mutates its source memory after issuing the send but
+    before arrival, the receiver still gets the snapshot. The snapshot is
+    None for control-only tokens (e.g. credit-only updates).
+    """
+
+    # ── Data movement (single-hop DMA write) ──
+    src_addr: int
+    src_space: str
+    dst_addr: int                # already-computed peer rx slot PA
+    dst_endpoint: IpcqEndpoint   # routing target (sip/cube/pe) + buffer_kind
+    nbytes: int
+    handle_id: str               # completion notify back to sender PE_IPCQ
+    # Optional shape/dtype carried for op_log + MemoryStore convenience.
+    shape: tuple[int, ...] = ()
+    dtype: str = "f16"
+    # In-flight data snapshot (sender PE_DMA captures this at send time).
+    data: Any = None
+
+    # ── Piggyback metadata (D9) ──
+    sender_seq: int = 0          # monotonic; receiver updates peer_head_cache
+    src_sip: int = 0
+    src_cube: int = 0
+    src_pe: int = 0
+    src_direction: str = "E"     # sender-side direction; receiver maps to its own
+
+    data_op: bool = True
+
+
+# ── D12: IpcqMetaArrival (PE_DMA → PE_IPCQ, intra-PE wire) ──────────
+
+
+@dataclass
+class IpcqMetaArrival:
+    """Posted by receiving PE_DMA into the destination PE's PE_IPCQ inbox
+    in the same SimPy step as the MemoryStore.write (D9, I6 MUST).
+
+    The receiver PE_IPCQ uses ``token.sender_seq`` to update its
+    peer_head_cache for the corresponding direction.
+    """
+
+    token: IpcqDmaToken
+
+
+# ── D12: IpcqCreditMetadata (PE_IPCQ → peer PE_IPCQ, fast path) ─────
+
+
+@dataclass(frozen=True)
+class IpcqCreditMetadata:
+    """Credit return — recv-side → send-side fast path (D9).
+
+    Sent by ``PeIpcqComponent._delayed_credit_send`` after a
+    bottleneck-BW based latency, putting the metadata directly into
+    the peer's pre-wired credit store (no fabric routing).
+    """
+
+    consumer_seq: int            # my_tail at recv side (new tail value)
+    src_sip: int                 # which peer is sending the credit
+    src_cube: int
+    src_pe: int
+    src_direction: str           # sender-side direction (peer maps to its own)
+
+
+# ── Request wrapper (PE_CPU → PE_IPCQ) ───────────────────────────────
+
+
+@dataclass
+class IpcqRequest:
+    """Wrapper carrying an IpcqSendCmd or IpcqRecvCmd plus a SimPy completion
+    event. Posted by PE_CPU into PE_IPCQ's inbox; PE_IPCQ calls
+    ``done.succeed()`` when the request is fully processed.
+
+    For recv requests, the result (slot address, direction, dtype, shape)
+    is written into ``result_data`` so the caller can read it after wait.
+    """
+
+    command: "IpcqSendCmd | IpcqRecvCmd"
+    done: "simpy.Event"
+    result_data: dict[str, Any] = field(default_factory=dict)
+
+
+# ── RecvFuture (kernel ↔ runner handshake for tl.recv_async / tl.wait) ─
+
+
+@dataclass
+class RecvFuture:
+    """Opaque future returned by ``tl.recv_async``.
+
+    The KernelRunner attaches a SimPy event and the IpcqRequest in the
+    background; ``tl.wait(future)`` switches back to the runner which
+    yields on the event and resolves the result into a TensorHandle.
+    """
+
+    cmd: "IpcqRecvCmd"
+    request: Any = None         # IpcqRequest (set by runner)
+    event: Any = None           # simpy.Event (set by runner)
+    resolved: bool = False
+    result: Any = None          # cached TensorHandle after wait()
@@ -33,6 +33,7 @@ class TensorHandle:
    dtype: str
    nbytes: int                      # total byte size
    data: object = None              # reserved for validate mode
+    space: str = "tcm"               # MemoryStore space ("tcm" | "hbm" | "sram")


@dataclass(frozen=True)