998cc85762
Major changes:
PE-level IPCQ infrastructure:
- New PE_IPCQ component: ring-buffer control plane with 4-direction
neighbor mapping, head/tail pointers, backpressure (poll/sleep).
- PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA,
including in-flight data snapshot (D9) and op_log recording at
outbound time for Phase 2 replay correctness.
- IpcqDmaToken piggyback model: data + metadata travel together,
atomic visibility at receiver (invariant I6).
- Credit return fast path: bottleneck-BW latency, no fabric vc_comm.
Phase 2 data execution (ADR-0020 integration):
- op_log extended: DmaWriteCmd now captures src_space/src_addr for
Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time.
- DataExecutor replays dma_write + ipcq_copy in t_start order.
- Engine._flush_data_phase: incremental cursor-based replay after
each engine.wait() so host reads see post-Phase-2 data.
- KernelRunner Phase 1 writes disabled when op_log is active to
prevent stale data from corrupting the MemoryStore snapshot.
TLContext / kernel API:
- tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype),
tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode.
- TensorHandle operator overloading (add/sub/mul/div) via thread-local
active TLContext → MathCmd dispatch through PE_MATH.
- PE-local scratch allocator for math output handles.
- tl.load returns space="hbm" handles for correct Phase 2 addressing.
- Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv.
Unified ccl_allreduce bench (PyTorch-compat host code):
- Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch)
split matching real PyTorch DDP worker pattern.
- torch.distributed facade: init_process_group, get_world_size,
get_rank, get_backend, all_reduce, barrier — only real PyTorch names.
- AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches
kernel via tensor shard metadata (n_elem from shards[0].nbytes).
- world_size derived from topology spec (sips × cubes × pes_per_cube)
with optional algorithm-level override in ccl.yaml.
Tensor API (PyTorch-compat surface):
- Tensor.numpy(): gather-aware (all shards via VA-based addressing).
- Tensor.copy_(source): scatter from host tensor into sharded target.
- RuntimeContext.from_numpy(arr): host-side staging tensor.
- Tensor.data property fixed to use numpy() (was shards[0]-only).
Algorithm modules moved to src/kernbench/ccl/algorithms/:
- ring_allreduce, mesh_allreduce, tree_allreduce, hello_send.
- Each module exports kernel_args(world_size, n_elem) helper.
- ccl.yaml module paths updated to kernbench.ccl.algorithms.*.
Dead code removed:
- 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.).
- _run_ccl_bench greenlet-per-SIP scheduler.
- benches.loader.is_ccl_bench + run_rank detection.
- benches/ccl/ directory.
Tests:
- New test_ccl_allreduce_matrix.py: 7 parametrized cases
(ring×3 buffers, ring 8/16, mesh 4, tree 7).
- New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests.
- Existing tests updated for new import paths + world_size_override.
Docs:
- Korean ccl-author-guide.md and ADR-0023 paths updated.
- New English versions: ccl-author-guide.en.md, ADR-0023.en.md.
502 tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
138 lines
5.9 KiB
YAML
138 lines
5.9 KiB
YAML
|
||
system:
|
||
ns_per_mm: 0.01 # wire propagation delay: 10 ps/mm (on-chip silicon)
|
||
|
||
sips:
|
||
count: 2
|
||
|
||
components:
|
||
switch: { kind: switch, impl: builtin.switch, attrs: { overhead_ns: 5.0 } }
|
||
|
||
links:
|
||
io_ep_to_switch:
|
||
kind: pcie
|
||
bw_gbs_per_ep: 768.0
|
||
distance_mm: 20.0
|
||
|
||
sip:
|
||
cube_mesh: { w: 4, h: 4 }
|
||
|
||
iochiplet:
|
||
components:
|
||
pcie_ep: { kind: pcie_ep, impl: builtin.pcie_ep, attrs: { overhead_ns: 5.0 } }
|
||
io_cpu: { kind: io_cpu, impl: builtin.io_cpu, attrs: { overhead_ns: 10.0 } }
|
||
io_noc: { kind: io_noc, impl: builtin.forwarding, attrs: { overhead_ns: 0.0 } }
|
||
links:
|
||
pcie_ep_to_noc_bw_gbs: 256.0
|
||
pcie_ep_to_noc_mm: 1.0
|
||
io_cpu_to_noc_bw_gbs: 256.0
|
||
io_cpu_to_noc_mm: 0.5
|
||
ucie:
|
||
overhead_ns: 8.0
|
||
n_connections: 4
|
||
per_connection_bw_gbs: 128.0 # 4 × 128 = 512 GB/s = PHY BW
|
||
noc_to_ucie_mm: 0.5
|
||
instances:
|
||
- id: io0
|
||
place: { side: N, offset_norm: 0.5 }
|
||
ucie: { phy_bw_gbs: 512.0, phys: [P0, P1, P2, P3] }
|
||
cube_ports:
|
||
- { cube: {xy: [0,0]}, cube_side: N, phy: P0, distance_mm: 2.0 }
|
||
- { cube: {xy: [1,0]}, cube_side: N, phy: P1, distance_mm: 2.0 }
|
||
- { cube: {xy: [2,0]}, cube_side: N, phy: P2, distance_mm: 2.0 }
|
||
- { cube: {xy: [3,0]}, cube_side: N, phy: P3, distance_mm: 2.0 }
|
||
|
||
links:
|
||
inter_cube_mesh:
|
||
bw_gbs_per_ucie_phy: 512.0
|
||
distance_mm_across_seam: 1.0
|
||
routing: { algo: xy }
|
||
|
||
cube:
|
||
geometry:
|
||
cube_mm: { w: 17.0, h: 14.0 }
|
||
hbm_mm: { w: 9.0, h: 5.0 }
|
||
ucie_mm: { size: 2.0 }
|
||
|
||
pe_layout:
|
||
corners: [NW, NE, SW, SE] # N corners → top PE rows; S corners → bottom PE rows
|
||
pe_per_corner: 2 # total PEs per cube: 4 * 2 = 8
|
||
|
||
pe_template:
|
||
components:
|
||
pe_cpu: { kind: pe_cpu, impl: builtin.pe_cpu, attrs: { overhead_ns: 2.0 } }
|
||
pe_scheduler: { kind: pe_scheduler, impl: builtin.pe_scheduler, attrs: { overhead_ns: 1.0 } }
|
||
pe_dma: { kind: pe_dma, impl: builtin.pe_dma, attrs: { rd_engines: 1, wr_engines: 1 } }
|
||
pe_gemm: { kind: pe_gemm, impl: builtin.pe_gemm, attrs: { overhead_ns: 0.0, shared_resource: accel_slot, peak_tflops_f16: 8.0 } }
|
||
pe_math: { kind: pe_math, impl: builtin.pe_math, attrs: { overhead_ns: 0.0, shared_resource: accel_slot } }
|
||
pe_fetch_store: { kind: pe_fetch_store, impl: builtin.pe_fetch_store, attrs: { overhead_ns: 0.0 } }
|
||
pe_mmu: { kind: pe_mmu, impl: builtin.pe_mmu, attrs: { tlb_overhead_ns: 0.5, page_size: 4096 } }
|
||
pe_tcm: { kind: pe_tcm, impl: builtin.pe_tcm, attrs: { size_mb: 16, read_bw_gbs: 512.0, write_bw_gbs: 512.0, kernel_scratch_mb: 1 } }
|
||
pe_ipcq: { kind: pe_ipcq, impl: builtin.pe_ipcq, attrs: { overhead_ns: 0.0 } }
|
||
links:
|
||
pe_cpu_to_scheduler_mm: 0.5
|
||
scheduler_to_dma_mm: 0.5
|
||
scheduler_to_gemm_mm: 0.5
|
||
scheduler_to_math_mm: 0.5
|
||
scheduler_to_fetch_store_mm: 0.5
|
||
dma_to_tcm_bw_gbs: 512.0
|
||
dma_to_tcm_mm: 0.5
|
||
dma_to_fetch_store_mm: 0.0 # DMA → fetch_store chaining (ADR-0021)
|
||
fetch_store_to_tcm_bw_gbs: 512.0
|
||
fetch_store_to_tcm_mm: 0.0
|
||
fetch_store_to_gemm_mm: 0.0 # fetch → GEMM chaining (ADR-0021)
|
||
fetch_store_to_math_mm: 0.0 # fetch → MATH chaining (ADR-0021)
|
||
gemm_to_fetch_store_mm: 0.0 # GEMM → store chaining (ADR-0021)
|
||
math_to_fetch_store_mm: 0.0 # MATH → store chaining (ADR-0021)
|
||
fetch_store_to_dma_mm: 0.0 # store → DMA writeback chaining (ADR-0021)
|
||
gemm_to_tcm_bw_gbs: 512.0
|
||
gemm_to_tcm_mm: 0.5
|
||
math_to_tcm_bw_gbs: 512.0
|
||
math_to_tcm_mm: 0.5
|
||
cpu_to_ipcq_mm: 0.5 # PE_CPU → PE_IPCQ (ADR-0023)
|
||
ipcq_to_dma_mm: 0.0 # PE_IPCQ → PE_DMA token forwarding (ADR-0023)
|
||
dma_to_ipcq_mm: 0.0 # PE_DMA → PE_IPCQ metadata arrival (ADR-0023)
|
||
|
||
memory_map:
|
||
hbm_total_gb_per_cube: 48
|
||
hbm_slices_per_cube: 8
|
||
hbm_total_bw_gbs: 1024.0
|
||
hbm_mapping_mode: n_to_one # one_to_one | n_to_one (ADR-0019)
|
||
hbm_pseudo_channels: 64 # total pseudo channels per cube
|
||
hbm_channels_per_pe: 8 # = pseudo_channels / pes_per_cube
|
||
hbm_channel_bw_gbs: 32.0 # per-channel bandwidth (GB/s)
|
||
|
||
components:
|
||
noc_router: { kind: noc_router, impl: builtin.forwarding, attrs: { overhead_ns: 2.0 } }
|
||
m_cpu: { kind: m_cpu, impl: builtin.m_cpu, attrs: { overhead_ns: 5.0 } }
|
||
hbm_ctrl: { kind: hbm_ctrl, impl: builtin.hbm_ctrl, attrs: { capacity: 1, efficiency: 1.0 } }
|
||
sram: { kind: sram, impl: builtin.sram, attrs: { size_mb: 32, overhead_ns: 2.0 } }
|
||
|
||
# Physical placement of non-PE components (mm coordinates)
|
||
placement:
|
||
m_cpu: { pos_mm: [7.5, 3.0] } # top center, below UCIe-N
|
||
sram: { pos_mm: [1.5, 9.0] } # left side, below HBM zone
|
||
|
||
ucie:
|
||
decompose: true
|
||
ports: [N, S, E, W]
|
||
overhead_ns: 8.0
|
||
n_connections: 4 # independent NOC↔UCIe connections per port
|
||
per_connection_bw_gbs: 128.0 # BW per connection; 4 × 128 = 512 GB/s = UCIe PHY BW
|
||
|
||
links:
|
||
# Router mesh links (ADR-0019)
|
||
router_link_bw_gbs: 256.0 # inter-router XY mesh link BW
|
||
router_overhead_ns: 2.0 # per-router switching overhead
|
||
pe_to_router_bw_gbs: 256.0 # PE_DMA ↔ router (= N × channel_bw)
|
||
hbm_to_router_bw_gbs: 256.0 # HBM_CTRL ↔ router (= N × channel_bw)
|
||
sram_to_router_bw_gbs: 128.0 # SRAM ↔ router
|
||
m_cpu_to_router_mm: 0.0 # M_CPU ↔ router distance
|
||
pe_dma_to_noc_bw_gbs: 256.0 # PE → router BW (= HBM slice BW, no bottleneck)
|
||
noc_to_pe_cpu_mm: 0.0 # router → PE_CPU distance (command path)
|
||
|
||
visualization:
|
||
emit_views: [system, sip, cube]
|
||
sip_ids: [0]
|
||
cubes: [0, 9, 15]
|