Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)
Major changes:
PE-level IPCQ infrastructure:
- New PE_IPCQ component: ring-buffer control plane with 4-direction
neighbor mapping, head/tail pointers, backpressure (poll/sleep).
- PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA,
including in-flight data snapshot (D9) and op_log recording at
outbound time for Phase 2 replay correctness.
- IpcqDmaToken piggyback model: data + metadata travel together,
atomic visibility at receiver (invariant I6).
- Credit return fast path: bottleneck-BW latency, no fabric vc_comm.
Phase 2 data execution (ADR-0020 integration):
- op_log extended: DmaWriteCmd now captures src_space/src_addr for
Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time.
- DataExecutor replays dma_write + ipcq_copy in t_start order.
- Engine._flush_data_phase: incremental cursor-based replay after
each engine.wait() so host reads see post-Phase-2 data.
- KernelRunner Phase 1 writes disabled when op_log is active to
prevent stale data from corrupting the MemoryStore snapshot.
TLContext / kernel API:
- tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype),
tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode.
- TensorHandle operator overloading (add/sub/mul/div) via thread-local
active TLContext → MathCmd dispatch through PE_MATH.
- PE-local scratch allocator for math output handles.
- tl.load returns space="hbm" handles for correct Phase 2 addressing.
- Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv.
Unified ccl_allreduce bench (PyTorch-compat host code):
- Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch)
split matching real PyTorch DDP worker pattern.
- torch.distributed facade: init_process_group, get_world_size,
get_rank, get_backend, all_reduce, barrier — only real PyTorch names.
- AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches
kernel via tensor shard metadata (n_elem from shards[0].nbytes).
- world_size derived from topology spec (sips × cubes × pes_per_cube)
with optional algorithm-level override in ccl.yaml.
Tensor API (PyTorch-compat surface):
- Tensor.numpy(): gather-aware (all shards via VA-based addressing).
- Tensor.copy_(source): scatter from host tensor into sharded target.
- RuntimeContext.from_numpy(arr): host-side staging tensor.
- Tensor.data property fixed to use numpy() (was shards[0]-only).
Algorithm modules moved to src/kernbench/ccl/algorithms/:
- ring_allreduce, mesh_allreduce, tree_allreduce, hello_send.
- Each module exports kernel_args(world_size, n_elem) helper.
- ccl.yaml module paths updated to kernbench.ccl.algorithms.*.
Dead code removed:
- 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.).
- _run_ccl_bench greenlet-per-SIP scheduler.
- benches.loader.is_ccl_bench + run_rank detection.
- benches/ccl/ directory.
Tests:
- New test_ccl_allreduce_matrix.py: 7 parametrized cases
(ring×3 buffers, ring 8/16, mesh 4, tree 7).
- New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests.
- Existing tests updated for new import paths + world_size_override.
Docs:
- Korean ccl-author-guide.md and ADR-0023 paths updated.
- New English versions: ccl-author-guide.en.md, ADR-0023.en.md.
502 tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -25,6 +25,7 @@ _PE_COMP_OFFSETS = {
|
||||
"pe_math": (0.0, 0.15),
|
||||
"pe_mmu": (0.15, -0.15),
|
||||
"pe_tcm": (0.3, 0.0),
|
||||
"pe_ipcq": (-0.15, 0.15),
|
||||
}
|
||||
|
||||
|
||||
@@ -698,6 +699,20 @@ def _add_pe_internal_edges(edges: list[Edge], pp: str, pe_links: dict) -> None:
|
||||
kind="pe_internal",
|
||||
))
|
||||
|
||||
# PE_IPCQ edges (ADR-0023 D1, D9 D10)
|
||||
ipcq_edges = [
|
||||
("pe_cpu", "pe_ipcq", "cpu_to_ipcq_mm"), # IpcqRequest
|
||||
("pe_ipcq", "pe_dma", "ipcq_to_dma_mm"), # IpcqDmaToken outbound
|
||||
("pe_dma", "pe_ipcq", "dma_to_ipcq_mm"), # IpcqMetaArrival inbound
|
||||
]
|
||||
for src_c, dst_c, mm_key in ipcq_edges:
|
||||
if mm_key in pe_links:
|
||||
edges.append(Edge(
|
||||
src=f"{pp}.{src_c}", dst=f"{pp}.{dst_c}",
|
||||
distance_mm=pe_links[mm_key],
|
||||
kind="pe_internal",
|
||||
))
|
||||
|
||||
|
||||
# ── Inter-cube / IO / system edges ──────────────────────────────────
|
||||
|
||||
@@ -765,7 +780,13 @@ def _add_io_to_cube_edges(
|
||||
def _add_system_to_io_edges(
|
||||
edges: list[Edge], sp: str, sip_spec: dict, system: dict,
|
||||
) -> None:
|
||||
"""Add fabric switch → IO chiplet PCIe edges."""
|
||||
"""Add bidirectional fabric switch ↔ IO chiplet PCIe edges.
|
||||
|
||||
Both directions are needed:
|
||||
switch → pcie_ep for host→device traffic (memory writes, kernel launch)
|
||||
pcie_ep → switch for device-side outbound traffic (cross-SIP IPCQ
|
||||
send between PE_DMAs through the system switch).
|
||||
"""
|
||||
sw_id = "fabric.switch0"
|
||||
sys_link = system["links"]["io_ep_to_switch"]
|
||||
for inst in sip_spec["iochiplet"]["instances"]:
|
||||
@@ -776,6 +797,12 @@ def _add_system_to_io_edges(
|
||||
bw_gbs=sys_link["bw_gbs_per_ep"],
|
||||
kind="pcie",
|
||||
))
|
||||
edges.append(Edge(
|
||||
src=pcie_ep_id, dst=sw_id,
|
||||
distance_mm=sys_link["distance_mm"],
|
||||
bw_gbs=sys_link["bw_gbs_per_ep"],
|
||||
kind="pcie",
|
||||
))
|
||||
|
||||
|
||||
# ── View builders ────────────────────────────────────────────────────
|
||||
@@ -1113,13 +1140,14 @@ def _build_pe_view(spec: dict) -> ViewGraph:
|
||||
"pe_math": (7.0, 6.5),
|
||||
"pe_mmu": (4.0, 1.5),
|
||||
"pe_tcm": (10.0, 4.0),
|
||||
"pe_ipcq": (4.0, 6.5),
|
||||
}
|
||||
|
||||
nodes: dict[str, Node] = {}
|
||||
view_edges: list[Edge] = []
|
||||
|
||||
for comp_name, comp_spec in pe_tmpl["components"].items():
|
||||
px, py = positions[comp_name]
|
||||
px, py = positions.get(comp_name, (1.0, 1.0))
|
||||
nodes[comp_name] = Node(
|
||||
id=comp_name, kind=comp_spec["kind"], impl=comp_spec["impl"],
|
||||
attrs=comp_spec["attrs"], pos_mm=(px, py),
|
||||
|
||||
Reference in New Issue
Block a user