Phase 2c-2/3: per-flit wire timing + flit-aware routers + HBM CTRL
Root cause of Phase 2c-1 timing collapse identified: src.out_port and dst.in_port aliased the same simpy.Store, so when wire chunkified a Transaction into Flits and re-put them, fan_in could pull flits before the wire applied bw delay — half the flits bypassed bottleneck timing. Fix: separate Stores per directed edge. Wire is the only conduit. Each flit on the wire incurs chunk_time = flit_nbytes/bw_gbs once, in arrival order. Multi-hop wormhole pipelining emerges naturally because flit-aware pass-through (TransitComponent) forwards each flit serially without reassembly. 64 KB MemoryWrite via UCIe 128 GB/s bottleneck: 273 ns (broken) → 545 ns (matches drain 512 + commit 8 + path overheads). 1 MB: 8230 ns (matches drain 8192). Single-flit transfer transport-time alone, exactly what real-HW wormhole produces. 3 pre-existing tests now off by small margins or inverted: - test_h2d_local_cube_cut_through: 65.53 vs threshold 65.0 - test_engine_override_is_scoped_to_impl: ZeroRouter inherits ComponentBase, not flit-aware, so override path reassembles at each hop while default doesn't - test_intra_sip_critical_path_at_96k_below_threshold: 96KB allreduce microscopically over its threshold Not weakening these to pass: they reflect model fidelity improvements that need calibrated thresholds. To address in follow-up via test threshold updates and ZeroRouter→TransitComponent inheritance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -85,15 +85,22 @@ class GraphEngine:
|
||||
for node_id, node in graph.nodes.items()
|
||||
}
|
||||
|
||||
# Wire ports: one Store per directed edge (ADR-0015 D1)
|
||||
# Wire ports: SEPARATE Stores for src.out_port and dst.in_port per
|
||||
# directed edge (ADR-0015 D1, ADR-0033 Phase 2c). The wire process
|
||||
# is the only conduit between them: pulls from src.out_port,
|
||||
# processes per-flit timing, puts on dst.in_port. Using separate
|
||||
# stores eliminates a race with `fan_in` that would otherwise let
|
||||
# flits bypass wire's BW occupancy (fan_in could pull a flit from
|
||||
# the same store before wire put it back delayed).
|
||||
for e in graph.edges:
|
||||
src_comp = self._components.get(e.src)
|
||||
dst_comp = self._components.get(e.dst)
|
||||
if src_comp is None or dst_comp is None:
|
||||
continue
|
||||
store: simpy.Store = simpy.Store(self._env)
|
||||
src_comp.out_ports[e.dst] = store
|
||||
dst_comp.in_ports[e.src] = store
|
||||
out_store: simpy.Store = simpy.Store(self._env)
|
||||
in_store: simpy.Store = simpy.Store(self._env)
|
||||
src_comp.out_ports[e.dst] = out_store
|
||||
dst_comp.in_ports[e.src] = in_store
|
||||
|
||||
# Wire processes: propagation delay + BW occupancy per edge (ADR-0015 D2)
|
||||
# Cut-through (wormhole) model: wires apply propagation delay per hop.
|
||||
@@ -267,25 +274,32 @@ class GraphEngine:
|
||||
available_at = 0.0
|
||||
while True:
|
||||
msg = yield out_port.get()
|
||||
# ADR-0033 Phase 2c-1: chunkify Transactions into Flits but
|
||||
# emit atomically (same env.now) to preserve current timing.
|
||||
# Phase 2c-2 will graduate to per-flit timing.
|
||||
# ADR-0033 Phase 2c-2/3: per-flit transport timing.
|
||||
# Transactions with payload chunkify into Flits; each flit
|
||||
# occupies the wire for ``flit_nbytes/bw_gbs`` and is
|
||||
# delivered after ``prop_ns + transfer_time``. Wormhole
|
||||
# pipelining emerges naturally because downstream flit-aware
|
||||
# components forward flits without reassembly.
|
||||
if isinstance(msg, Transaction) and msg.nbytes > 0:
|
||||
items = list(msg.into_flits(self._flit_bytes))
|
||||
payload_nbytes = msg.nbytes
|
||||
else:
|
||||
items = [msg]
|
||||
payload_nbytes = getattr(msg, "nbytes", 0) or 0
|
||||
# BW occupancy: wait for link to become free, then mark busy
|
||||
if bw_gbs > 0 and payload_nbytes > 0:
|
||||
wait = available_at - self._env.now
|
||||
if wait > 0:
|
||||
yield self._env.timeout(wait)
|
||||
available_at = self._env.now + (payload_nbytes / bw_gbs)
|
||||
# Propagation delay
|
||||
if prop_ns > 0:
|
||||
yield self._env.timeout(prop_ns)
|
||||
for item in items:
|
||||
if isinstance(item, Flit):
|
||||
item_nbytes = item.flit_nbytes
|
||||
elif isinstance(item, Transaction):
|
||||
item_nbytes = item.nbytes
|
||||
else:
|
||||
item_nbytes = getattr(item, "nbytes", 0) or 0
|
||||
if bw_gbs > 0 and item_nbytes > 0:
|
||||
wait = available_at - self._env.now
|
||||
if wait > 0:
|
||||
yield self._env.timeout(wait)
|
||||
available_at = self._env.now + item_nbytes / bw_gbs
|
||||
yield self._env.timeout(prop_ns + item_nbytes / bw_gbs)
|
||||
else:
|
||||
if prop_ns > 0:
|
||||
yield self._env.timeout(prop_ns)
|
||||
yield in_port.put(item)
|
||||
|
||||
def _process(self, key: str, request: Any, done: simpy.Event):
|
||||
|
||||
Reference in New Issue
Block a user