Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

Major changes: PE-level IPCQ infrastructure: - New PE_IPCQ component: ring-buffer control plane with 4-direction neighbor mapping, head/tail pointers, backpressure (poll/sleep). - PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA, including in-flight data snapshot (D9) and op_log recording at outbound time for Phase 2 replay correctness. - IpcqDmaToken piggyback model: data + metadata travel together, atomic visibility at receiver (invariant I6). - Credit return fast path: bottleneck-BW latency, no fabric vc_comm. Phase 2 data execution (ADR-0020 integration): - op_log extended: DmaWriteCmd now captures src_space/src_addr for Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time. - DataExecutor replays dma_write + ipcq_copy in t_start order. - Engine._flush_data_phase: incremental cursor-based replay after each engine.wait() so host reads see post-Phase-2 data. - KernelRunner Phase 1 writes disabled when op_log is active to prevent stale data from corrupting the MemoryStore snapshot. TLContext / kernel API: - tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype), tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode. - TensorHandle operator overloading (add/sub/mul/div) via thread-local active TLContext → MathCmd dispatch through PE_MATH. - PE-local scratch allocator for math output handles. - tl.load returns space="hbm" handles for correct Phase 2 addressing. - Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv. Unified ccl_allreduce bench (PyTorch-compat host code): - Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch) split matching real PyTorch DDP worker pattern. - torch.distributed facade: init_process_group, get_world_size, get_rank, get_backend, all_reduce, barrier — only real PyTorch names. - AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches kernel via tensor shard metadata (n_elem from shards[0].nbytes). - world_size derived from topology spec (sips × cubes × pes_per_cube) with optional algorithm-level override in ccl.yaml. Tensor API (PyTorch-compat surface): - Tensor.numpy(): gather-aware (all shards via VA-based addressing). - Tensor.copy_(source): scatter from host tensor into sharded target. - RuntimeContext.from_numpy(arr): host-side staging tensor. - Tensor.data property fixed to use numpy() (was shards[0]-only). Algorithm modules moved to src/kernbench/ccl/algorithms/: - ring_allreduce, mesh_allreduce, tree_allreduce, hello_send. - Each module exports kernel_args(world_size, n_elem) helper. - ccl.yaml module paths updated to kernbench.ccl.algorithms.*. Dead code removed: - 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.). - _run_ccl_bench greenlet-per-SIP scheduler. - benches.loader.is_ccl_bench + run_rank detection. - benches/ccl/ directory. Tests: - New test_ccl_allreduce_matrix.py: 7 parametrized cases (ring×3 buffers, ring 8/16, mesh 4, tree 7). - New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests. - Existing tests updated for new import paths + world_size_override. Docs: - Korean ccl-author-guide.md and ADR-0023 paths updated. - New English versions: ccl-author-guide.en.md, ADR-0023.en.md. 502 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 19:36:59 -07:00
parent ff2c677a9c
commit 998cc85762
60 changed files with 9196 additions and 80 deletions
@@ -106,18 +106,131 @@ class PeDmaComponent(PeEngineBase):
        pe_txn.done.succeed()

    def _worker(self, env: simpy.Environment) -> Generator:
-        """Handle TileToken (pipeline), PeInternalTxn (legacy), and Transaction (fabric)."""
+        """Handle TileToken (pipeline), PeInternalTxn (legacy), IpcqDmaToken,
+        and Transaction (fabric)."""
+        from kernbench.common.ipcq_types import IpcqDmaToken
        from kernbench.common.pe_commands import PeInternalTxn
        from kernbench.components.builtin.pe_types import TileToken

        while True:
            msg: Any = yield self._inbox.get()
-            if isinstance(msg, TileToken):
+            if isinstance(msg, IpcqDmaToken):
+                # Outbound: IPCQ token from local PE_IPCQ → forward via fabric
+                env.process(self._handle_ipcq_outbound(env, msg))
+            elif isinstance(msg, TileToken):
                env.process(self._pipeline_process(env, msg))
            elif isinstance(msg, PeInternalTxn):
                env.process(self._handle_with_hooks(env, msg))
            else:
-                env.process(self._forward_txn(env, msg))
+                # Transaction (or unknown). May carry IpcqDmaToken inbound.
+                req = getattr(msg, "request", None)
+                if isinstance(req, IpcqDmaToken):
+                    env.process(self._handle_ipcq_inbound(env, msg))
+                else:
+                    env.process(self._forward_txn(env, msg))
+
+    # ── IPCQ outbound (PE_IPCQ → PE_DMA → fabric) ───────────────────
+
+    def _handle_ipcq_outbound(self, env: simpy.Environment, token: Any) -> Generator:
+        """Forward IpcqDmaToken from local PE_IPCQ through the fabric to peer
+        PE_DMA. ADR-0023 D8 (vc_comm channel)."""
+        if self.ctx is None:
+            return  # nothing to do
+        peer = token.dst_endpoint
+        peer_pe_dma = f"sip{peer.sip}.cube{peer.cube}.pe{peer.pe}.pe_dma"
+
+        # Snapshot the source data at send time (D9 in-flight semantics).
+        # Without this, the receiver could read stale or future data if the
+        # sender mutates src_addr between send issue and DMA arrival.
+        store = getattr(self.ctx, "memory_store", None)
+        if store is not None and token.data is None:
+            try:
+                snap = store.read(
+                    token.src_space, token.src_addr,
+                    shape=token.shape, dtype=token.dtype,
+                )
+                # Copy so later mutations to src_addr don't affect the snapshot.
+                token.data = snap.copy() if hasattr(snap, "copy") else snap
+            except Exception:
+                token.data = None
+
+        # Record the IPCQ copy in op_log at OUTBOUND time. ADR-0020 D6:
+        # Phase 2 replays the copy in t_start order; using outbound time
+        # (rather than inbound) ensures the copy executes before any later
+        # local op at the sender that might overwrite token.src_addr (e.g.
+        # a tl.store after a recv).
+        if self._op_logger is not None:
+            try:
+                self._op_logger.record_copy(
+                    t_start=float(env.now), t_end=float(env.now),
+                    component_id=self.node.id,
+                    src_space=token.src_space, src_addr=token.src_addr,
+                    dst_space=peer.buffer_kind,
+                    dst_addr=token.dst_addr,
+                    shape=token.shape, dtype=token.dtype, nbytes=token.nbytes,
+                )
+            except Exception:
+                pass
+
+        try:
+            path = self.ctx.router.find_path(self._pe_prefix, peer_pe_dma)
+        except Exception:
+            return
+        drain_ns = self.ctx.compute_drain_ns(path, token.nbytes)
+
+        sub_done = env.event()
+        sub_txn = Transaction(
+            request=token, path=path, step=0,
+            nbytes=token.nbytes, done=sub_done, drain_ns=drain_ns,
+        )
+        if len(path) > 1:
+            next_hop = path[1]
+            if next_hop in self.out_ports:
+                yield self.out_ports[next_hop].put(sub_txn.advance())
+            else:
+                return
+        # Note: don't wait on sub_done here — fire-and-forget for vc_comm.
+        # IPCQ slot bookkeeping (peer_head) was already updated by PE_IPCQ;
+        # backpressure is via credit return, not via this DMA's completion.
+
+    # ── IPCQ inbound (fabric → PE_DMA → MemoryStore + PE_IPCQ) ──────
+
+    def _handle_ipcq_inbound(self, env: simpy.Environment, txn: Any) -> Generator:
+        """At destination PE_DMA: atomically write data and forward metadata.
+
+        I6 (MUST): no SimPy yield between MemoryStore.write and the
+        IpcqMetaArrival put into PE_IPCQ.
+        """
+        from kernbench.common.ipcq_types import IpcqMetaArrival
+
+        token = txn.request
+
+        # ── ATOMIC: do not introduce yield between these two operations ──
+        # 1. Move data via MemoryStore (single-hop DMA write).
+        # Prefer the in-flight snapshot stashed by the sender PE_DMA;
+        # fall back to a fresh read of src_addr if no snapshot is present
+        # (e.g. control-only token).
+        store = getattr(self.ctx, "memory_store", None) if self.ctx else None
+        if store is not None:
+            try:
+                data = token.data
+                if data is None:
+                    data = store.read(
+                        token.src_space, token.src_addr,
+                        shape=token.shape, dtype=token.dtype,
+                    )
+                store.write(token.dst_endpoint.buffer_kind, token.dst_addr, data)
+            except Exception:
+                pass
+
+        # 2. Forward IpcqMetaArrival to local PE_IPCQ
+        ipcq_id = f"{self._pe_prefix}.pe_ipcq"
+        if ipcq_id in self.out_ports:
+            yield self.out_ports[ipcq_id].put(IpcqMetaArrival(token=token))
+        # ─────────────────────────────────────────────────────────────────
+
+        if not txn.done.triggered:
+            txn.done.succeed()

    def _pipeline_process(self, env: simpy.Environment, token: Any) -> Generator:
        """Pipeline mode: DMA read/write via fabric, then self-route."""