Remove rack_id (4 bits), rename sip_seg→die_id, shift fields to enable
42-bit local_offset (4 TB per die). Define PE_LOCAL/MCPU_LOCAL/CUBE_SRAM
sub-unit tables for AHBM dies and IOCPU sub-unit table for IOCHIPLET
dies (1 TB window). Supersedes ADR-0031.
Also fixes latent VA/PA confusion in pe_dma pipeline DMA path where
virtual addresses were decoded as physical addresses without MMU
translation — previously masked by coincidental bit-position alignment.
529 passed (+6 recovered), 10 pre-existing failures unchanged.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- KernelLaunchMsg gains target_start_ns: IO_CPU stamps a global barrier
(max path latency across every target PE), M_CPU passes it through,
PE_CPU yields until it before recording pe_exec_start. Every PE in a
launch begins kernel execution at the same env.now regardless of its
dispatch path length — eliminates per-PE dispatch-offset artifact in
cross-PE and cross-cube latency measurements.
- PE_DMA._handle_ipcq_inbound now pays Transaction.drain_ns at the top,
matching the terminal-drain behavior of ComponentBase._forward_txn for
every non-IPCQ Transaction. SRC-side tl.send stays fire-and-forget
(sender doesn't yield on sub_done); tl.recv now blocks until bytes
have actually drained into its inbox.
- ComponentContext: new compute_path_latency_ns helper + node_overhead_ns
field populated by GraphEngine.
- tests/test_kernel_launch_sync.py: asserts all PEs in one launch
produce identical pe_exec_ns for a no-op kernel (zero spread).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements ADR-0027 Phase 2 end-to-end. All 559 tests pass (was 523 +
1 xfail; ring_default_ws strict-xfail is now resolved).
D0 — Worker-wait generalization (context.py):
- _pending_worker_waits queue on RuntimeContext.
- ctx.wait(h) in worker context defers to main via g.parent.switch().
Fast-path for already-completed handles.
- Worker API is unchanged: tensor deploy, launch, etc. still look
synchronous; they're transparently cooperatively scheduled.
- Solves ADR-0024 Phase B kernel-greenlet orphan bug (env.run now
only ever drives from main; kernel _parent is always main).
D0.5 — Host-read barrier (tensor.py):
- Explicit _HOST_READ_BARRIERS registry (T5.g closed-set via code
review, not reflection-magic).
- numpy/data/__getitem__/__repr__ drain pending worker-waits before
host-observable read.
- copy_: source-side barrier via source.numpy(). Target-side write
barrier is intentionally NOT applied — global pending target barrier
prematurely drains cross-rank collectives → deadlock.
- Collective pending is excluded from barrier drain condition
(collective is cross-rank; its own yield in all_reduce covers the
invariant naturally).
D1 — torch.multiprocessing.spawn (runtime_api/multiprocessing.py):
- API signature parity with real PyTorch spawn; execution is
cooperative greenlet scheduler (process isolation etc. are explicit
non-goals per D1.0).
- _drain_pending drains worker-waits then collectives in one barrier,
loop-until-empty.
- Round-based exception handling with SystemExit sibling abort +
SpawnException(errors) wrapping root-cause ranks.
- RuntimeContext attaches ctx.multiprocessing in __post_init__.
- benches/ccl_allreduce.py hand-rolled loop collapses to one
torch.multiprocessing.spawn call.
D2–D6 — kernbench.tp package:
- parallel_state: initialize_model_parallel, get_*_rank,
get_*_world_size, with weak active-ctx registry in context.py.
- layers: ColumnParallelLinear, RowParallelLinear (shape-only
primitives — fp16 gemm via tl.load + tl.dot + tl.store).
- kernels: _gemm_kernel used by TP layers (self-contained; no bench
dependency).
- primitives / mappings stubs per D6/D8.
Data-path fixes (surfaced by TP gemm + all_reduce sequence):
- sim_engine/op_log.py: dma_write snapshot is skipped for TCM
sources (PE scratch is repopulated by Phase 2 math/gemm replay —
capturing Phase-1-time snapshot picked up STALE data from prior
kernel's output aliased at the same scratch addr, causing the later
kernel's dma_write to overwrite Phase 2 result with stale value).
- sim_engine/op_log.py + sim_engine/data_executor.py: per-operand
space recorded on GemmCmd and composite gemm records so HBM-resident
operands (tl.load output) don't default to TCM during replay.
- runtime_api/context.py: ctx.zeros writes zero-init to MemoryStore
at VA keys so kernels reading via VA see deterministic init even
without explicit copy_().
Tests (Phase 1 + Phase 2):
- test_worker_wait_drain (T3): orphan invariant + resume + multi-rank
drain + idempotency + exception propagation.
- test_mp_spawn (T4): spawn shape + bind + SpawnException scope.
- test_host_read_barrier (T5): barrier contract per entry-point +
closed-set registry check.
- test_tp_parallel_state (T1): initialize + rank lookup.
- test_tp_layers (T2): shape + deterministic numerical correctness
(concat-matmul equality for RowParallel, not mean-only).
- test_tp_mlp (T6): full 2-layer MLP with deterministic weight
numerical match + rank-consistency post all-reduce.
- test_ccl_allreduce_matrix: ring_default_ws xfail removed (T7).
Regression: 523 pre + 35 new + 1 ex-xfail = 559 passed, 1 intentional
skip (T3.e historical failure documentation).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause (hang diagnosis):
`kernel_runner.run()` captures `greenlet.getcurrent()` at spawn time as
the kernel greenlet's `_parent`. When a worker greenlet (say g0) calls
`dist.all_reduce` → `ctx.wait(h)` → `env.run(until=h0)`, the SimPy
scheduler steps pe_cpu processes, which in turn spawn kernel greenlets.
Those kernels' `_parent` becomes g0 (current greenlet at spawn). When a
kernel yields via switch_to_simpy, control jumps back up to g0's LAST
switch point — which is the main scheduler's `g.switch()` call — rather
than the kernel_runner's generator frame. Main then re-enters its
`for g in alive: g.switch()` loop mid-wait, producing nested greenlet
re-entry. Scheduler spins: g0 never completes, g1 appears to complete
out of order, infinite loop at 100% CPU.
Fix:
- AhbmCCLBackend.all_reduce: in multi-greenlet mode, submit via
launch(_defer_wait=True), extend backend._pending_collective_handles,
and yield to the parent greenlet. Worker does NOT call wait.
- benches/ccl_allreduce.py run(): after each scheduler round, the MAIN
greenlet drains backend._pending_collective_handles. This keeps
env.run invocation in the main context, so kernel_runner's spawned
kernel greenlets have main as their _parent — no nested re-entry.
- Legacy single-driver path (no bench scheduler): all_reduce falls back
to inline wait when g.parent is None.
Result:
- Multi-greenlet cross-SIP ring no longer hangs (was 100% CPU infinite
loop in kernel_runner._switch_kernel).
- ring_default_ws still xfail(strict=True): now fails as a data
correctness issue — DataExecutor reports only 1 math op for a 2-rank
ring (expected 2). Cross-SIP op_log replay integration is the
remaining Phase B task.
514 passed, 1 xfailed (strict).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tensor.__setitem__ / __getitem__:
- Shard-aligned slice assignment and read on deployed tensors.
- Scalar broadcast and numpy array assignment supported.
- Cross-shard slices raise NotImplementedError (use copy_ for that).
- 3 new tests: single-PE, multi-PE, cross-shard error case.
Hierarchical all-reduce kernel (src/kernbench/ccl/algorithms/):
- 3-level reduce: intra-cube (E/W) → inter-cube (N/S) → inter-SIP (parent).
- Bidirectional ring reduce at each level: ceil((N-1)/2) rounds.
Left half sends via dir_dec, right half via dir_inc (wrap).
Representative receives from both sides.
- Chain broadcast for reverse path: cube 0 PE 0 → all PE 0s → all PEs.
- Registered in ccl.yaml as "hierarchical_allreduce" with topology: none
(neighbors() override builds the full 3-level neighbor map).
- kernel_args derives pes_per_cube/cubes_per_sip/num_sips from world_size.
- Mock-verified at 8/16/32/64/128 ranks.
Mock runtime fixes:
- Direction pairing: explicit N↔S, E↔W, parent↔parent instead of
"first matching reverse". Fixes 2-element rings where N and S both
point to the same peer.
- Deadlock detection: send-counter based (not just queue-depth-total)
to catch chain reductions where send+recv pairs net to zero.
- Multi-cube program_id: pes_per_cube parameter enables
program_id(axis=0) = PE within cube, program_id(axis=1) = cube id.
Legacy single-cube tests unaffected (default = world_size).
504 tests pass in 12s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- CLI: --verify-data flag enables Phase 2 data verification (ADR-0020)
- Tensor.data: returns actual numpy values (verify-data) or zeros placeholder
- Tensor.__repr__: shows value summary or data=N/A (placeholder)
- DataExecutor: ThreadPoolExecutor for same-timestamp parallel op execution
- BenchResult.engine: exposes op_log/memory_store for Phase 2 access
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RuntimeContext._ensure_allocators() now limits SIP range to
target_device (single SIP or all). Prevents cross-SIP tensor
deployment that caused PE_TCM routing errors.
Also accept 'sip0' format (without colon) in DeviceSelector.
331 passed, 8 skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add cycle-accurate PE accelerator scheduler (SchedulerV2) with tiled
GEMM/Math pipelines (DMA_IN → GEMM → MATH → DMA_WB)
- Add DPPolicy num_pes/num_cubes/num_sips overrides for single-PE testing
- Support tuple target_pe for targeting specific PE subsets
- Add gemm_single_pe and gpt3_qkv benchmarks
- Switch default topology to pe_scheduler_v2
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>