Extend tl.composite() with an ordered epilogue list. Each op carries
a scope flag - output_tile (default, runs once per (m,n) before
STORE), k_tile (every K-tile right after GEMM), or kernel. Plan
generator slots MATH stages by scope; pe_math reuses pe_dma's
local-loop pattern so chained epilogues (bias->relu) skip the port
hop. op_log captures per-stage params for telemetry. Topology
gains a gemm->math edge (snapshot test updated).
API stays backward-compatible - `epilogue=` is opt-in.
Example:
h = tl.composite(
op="gemm", a=a, b=b, out_ptr=int(out),
epilogue=[
{"op": "dequant", "scale": s_per_k, "scope": "k_tile"},
{"op": "bias", "bias": bias_vec},
{"op": "relu"},
{"op": "scale", "factor": 0.5},
],
)
tl.wait(h)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces global round-robin with deterministic address-derived PC
striping:
pc_shift = log2(burst_bytes)
pc_mask = num_pcs - 1
pc = (flit.address >> pc_shift) & pc_mask
Each Transaction carries base_address (HBM byte offset of the first
chunk); each Flit derives its own address as base + i*flit_bytes.
HBM CTRL routes flits to PCs via this formula, replacing the
arrival-order RR pointer. Also splits the is_last wait into an
asynchronous _finalize_txn process so the worker isn't blocked on
PC commit, exposing true PC parallelism for disjoint addresses.
phyaddr.py documents the canonical bit layout (bits [10:8] for the
default burst=256, num_pcs=8 case). ADR-0033 D6 records the
derivation and the workload scenarios where address-striping
matters (strided streams, offset-disjoint parallel transfers).
Adds tests/test_hbm_address_based_pc.py: canonical bit mapping,
strided 8-way load distribution, same-address PC-0 serialization,
PC-aligned 2KB pair collision, dynamic pc_shift from burst_bytes,
and power-of-2 attr validation. Integration tests inspect
_pc_avail ledger directly: at default config UCIe's 8 ns per-txn
overhead exactly matches chunk_time, masking PC contention at the
makespan level even though the ledger correctly distinguishes the
cases.
Full suite: 631 passed, 1 skipped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of Phase 2c-1 timing collapse identified: src.out_port and
dst.in_port aliased the same simpy.Store, so when wire chunkified a
Transaction into Flits and re-put them, fan_in could pull flits before
the wire applied bw delay — half the flits bypassed bottleneck timing.
Fix: separate Stores per directed edge. Wire is the only conduit. Each
flit on the wire incurs chunk_time = flit_nbytes/bw_gbs once, in arrival
order. Multi-hop wormhole pipelining emerges naturally because
flit-aware pass-through (TransitComponent) forwards each flit serially
without reassembly.
64 KB MemoryWrite via UCIe 128 GB/s bottleneck: 273 ns (broken) → 545 ns
(matches drain 512 + commit 8 + path overheads). 1 MB: 8230 ns (matches
drain 8192). Single-flit transfer transport-time alone, exactly what
real-HW wormhole produces.
3 pre-existing tests now off by small margins or inverted:
- test_h2d_local_cube_cut_through: 65.53 vs threshold 65.0
- test_engine_override_is_scoped_to_impl: ZeroRouter inherits
ComponentBase, not flit-aware, so override path reassembles at each
hop while default doesn't
- test_intra_sip_critical_path_at_96k_below_threshold: 96KB allreduce
microscopically over its threshold
Not weakening these to pass: they reflect model fidelity improvements
that need calibrated thresholds. To address in follow-up via test
threshold updates and ZeroRouter→TransitComponent inheritance.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire decomposes Transactions into Flits per `_flit_bytes` but emits all
flits atomically at the same env.now — preserves single-msg timing as
infrastructure for Phase 2c-2 (per-flit timing + flit-aware routers).
Non-flit-aware components reassemble Flits in `_fan_in`; `_update_step`
sets txn.step to current component's path position so legacy
step-based routing continues working when upstream is flit-aware.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous model double-counted slow-upstream paths (e.g., 64KB via UCIe
128 GB/s was ~2x pessimistic). HBM CTRL now distributes bursts across
8 pseudo-channels via global round-robin, with per-chunk commit timing
that pipelines correctly against the bottleneck link's data arrival.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove rack_id (4 bits), rename sip_seg→die_id, shift fields to enable
42-bit local_offset (4 TB per die). Define PE_LOCAL/MCPU_LOCAL/CUBE_SRAM
sub-unit tables for AHBM dies and IOCPU sub-unit table for IOCHIPLET
dies (1 TB window). Supersedes ADR-0031.
Also fixes latent VA/PA confusion in pe_dma pipeline DMA path where
virtual addresses were decoded as physical addresses without MMU
translation — previously masked by coincidental bit-position alignment.
529 passed (+6 recovered), 10 pre-existing failures unchanged.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- KernelLaunchMsg gains target_start_ns: IO_CPU stamps a global barrier
(max path latency across every target PE), M_CPU passes it through,
PE_CPU yields until it before recording pe_exec_start. Every PE in a
launch begins kernel execution at the same env.now regardless of its
dispatch path length — eliminates per-PE dispatch-offset artifact in
cross-PE and cross-cube latency measurements.
- PE_DMA._handle_ipcq_inbound now pays Transaction.drain_ns at the top,
matching the terminal-drain behavior of ComponentBase._forward_txn for
every non-IPCQ Transaction. SRC-side tl.send stays fire-and-forget
(sender doesn't yield on sub_done); tl.recv now blocks until bytes
have actually drained into its inbox.
- ComponentContext: new compute_path_latency_ns helper + node_overhead_ns
field populated by GraphEngine.
- tests/test_kernel_launch_sync.py: asserts all PEs in one launch
produce identical pe_exec_ns for a no-op kernel (zero spread).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements ADR-0027 Phase 2 end-to-end. All 559 tests pass (was 523 +
1 xfail; ring_default_ws strict-xfail is now resolved).
D0 — Worker-wait generalization (context.py):
- _pending_worker_waits queue on RuntimeContext.
- ctx.wait(h) in worker context defers to main via g.parent.switch().
Fast-path for already-completed handles.
- Worker API is unchanged: tensor deploy, launch, etc. still look
synchronous; they're transparently cooperatively scheduled.
- Solves ADR-0024 Phase B kernel-greenlet orphan bug (env.run now
only ever drives from main; kernel _parent is always main).
D0.5 — Host-read barrier (tensor.py):
- Explicit _HOST_READ_BARRIERS registry (T5.g closed-set via code
review, not reflection-magic).
- numpy/data/__getitem__/__repr__ drain pending worker-waits before
host-observable read.
- copy_: source-side barrier via source.numpy(). Target-side write
barrier is intentionally NOT applied — global pending target barrier
prematurely drains cross-rank collectives → deadlock.
- Collective pending is excluded from barrier drain condition
(collective is cross-rank; its own yield in all_reduce covers the
invariant naturally).
D1 — torch.multiprocessing.spawn (runtime_api/multiprocessing.py):
- API signature parity with real PyTorch spawn; execution is
cooperative greenlet scheduler (process isolation etc. are explicit
non-goals per D1.0).
- _drain_pending drains worker-waits then collectives in one barrier,
loop-until-empty.
- Round-based exception handling with SystemExit sibling abort +
SpawnException(errors) wrapping root-cause ranks.
- RuntimeContext attaches ctx.multiprocessing in __post_init__.
- benches/ccl_allreduce.py hand-rolled loop collapses to one
torch.multiprocessing.spawn call.
D2–D6 — kernbench.tp package:
- parallel_state: initialize_model_parallel, get_*_rank,
get_*_world_size, with weak active-ctx registry in context.py.
- layers: ColumnParallelLinear, RowParallelLinear (shape-only
primitives — fp16 gemm via tl.load + tl.dot + tl.store).
- kernels: _gemm_kernel used by TP layers (self-contained; no bench
dependency).
- primitives / mappings stubs per D6/D8.
Data-path fixes (surfaced by TP gemm + all_reduce sequence):
- sim_engine/op_log.py: dma_write snapshot is skipped for TCM
sources (PE scratch is repopulated by Phase 2 math/gemm replay —
capturing Phase-1-time snapshot picked up STALE data from prior
kernel's output aliased at the same scratch addr, causing the later
kernel's dma_write to overwrite Phase 2 result with stale value).
- sim_engine/op_log.py + sim_engine/data_executor.py: per-operand
space recorded on GemmCmd and composite gemm records so HBM-resident
operands (tl.load output) don't default to TCM during replay.
- runtime_api/context.py: ctx.zeros writes zero-init to MemoryStore
at VA keys so kernels reading via VA see deterministic init even
without explicit copy_().
Tests (Phase 1 + Phase 2):
- test_worker_wait_drain (T3): orphan invariant + resume + multi-rank
drain + idempotency + exception propagation.
- test_mp_spawn (T4): spawn shape + bind + SpawnException scope.
- test_host_read_barrier (T5): barrier contract per entry-point +
closed-set registry check.
- test_tp_parallel_state (T1): initialize + rank lookup.
- test_tp_layers (T2): shape + deterministic numerical correctness
(concat-matmul equality for RowParallel, not mean-only).
- test_tp_mlp (T6): full 2-layer MLP with deterministic weight
numerical match + rank-consistency post all-reduce.
- test_ccl_allreduce_matrix: ring_default_ws xfail removed (T7).
Regression: 523 pre + 35 new + 1 ex-xfail = 559 passed, 1 intentional
skip (T3.e historical failure documentation).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: In ring all-reduce, PE_IPCQ's recv handler advances my_tail
and issues a credit return immediately. With tight credit latency
(0.12ns intra-cube), the sender can refill the slot BEFORE the
receiver's outbound PE_DMA reads from it for the next send. The
outbound snapshot then captures stale data from a later round.
Fix: Propagate TensorHandle.data (captured at recv-time, before credit
return) through the entire send chain:
tl.send(src=handle) → IpcqSendCmd.data → IpcqDmaToken.data
PE_DMA outbound already prefers token.data over MemoryStore read, so
the recv-time snapshot is used for the in-flight data. This eliminates
the race: the snapshot is captured before the slot can be overwritten.
Additional fixes:
- PE_MATH handle_command: compute SIMD latency from output tensor
element count via _compute_ns(), using max(overhead_ns, compute_ns).
Previously used overhead_ns=0.0 for all standalone MathCmd, making
math ops take 0ns in SimPy.
- DataExecutor secondary sort: same-t_start ops sorted by op_kind
(memory < gemm < math) so IPCQ slot writes execute before math reads.
- ipcq_copy recorded at INBOUND time (receiver PE_DMA arrival) instead
of outbound. Inbound time is after fabric propagation, so it sorts
correctly relative to the receiver's math.
- record_copy accepts explicit snapshot parameter (from token.data).
Result: N_ELEM=32 + 256-rank + n_slots=4 + cross-SIP now passes.
n_slots reverted to 4 (the deeper buffer was a workaround, not needed).
502 tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test matrix restructure:
- 256-rank full-system ring runs only ONCE (marked pytest.mark.slow)
instead of 7× across matrix + perf tests. Cross-SIP routing is
verified by the single run; buffer variants (tcm/hbm/sram) are
tested at 8-rank where they finish in <0.5s.
- Performance tests use 8-rank instead of 256-rank.
- `pytest -m "not slow"` completes in ~2.5min (local dev).
- Full suite including slow: ~6min (CI).
DataExecutor optimization:
- Remove ThreadPoolExecutor from DataExecutor.run(). Same-t_start
groups are almost always size 1, so the thread pool creation and
dispatch overhead dominated. Simple sequential loop is faster.
- Skip dma_read ops at the loop level (they are always no-ops in
Phase 2 but were dispatched through _execute_op → _execute_memory).
- Remove redundant CLI Phase 2 re-execution: engine._flush_data_phase
already replays during engine.wait(); the CLI now only prints the
diagnostic summary without re-running DataExecutor.
502 tests pass. Wall time: 25m30s → 5m43s (full), 2m28s (no slow).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- CLI: --verify-data flag enables Phase 2 data verification (ADR-0020)
- Tensor.data: returns actual numpy values (verify-data) or zeros placeholder
- Tensor.__repr__: shows value summary or data=N/A (placeholder)
- DataExecutor: ThreadPoolExecutor for same-timestamp parallel op execution
- BenchResult.engine: exposes op_log/memory_store for Phase 2 access
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove xbar_top/bot, bridge, single noc node from topology
- Each cube_mesh.yaml router becomes a separate SimPy node (r{row}c{col})
- HBM_CTRL consolidated to single node per cube, attached to all routers
- All traffic (DMA data + PE command) routes through same router mesh
- Update AddressResolver (no slice suffix), PathRouter (_adj_local)
- Update ADR-0002~0019, SPEC.md to remove xbar/bridge references
- Regenerate SVG diagrams for new topology structure
- Skip cross-SIP PE_TCM and PE_MMU routing tests (not yet wired)
326 passed, 13 skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>