Phase 2c-2/3: per-flit wire timing + flit-aware routers + HBM CTRL

Root cause of Phase 2c-1 timing collapse identified: src.out_port and
dst.in_port aliased the same simpy.Store, so when wire chunkified a
Transaction into Flits and re-put them, fan_in could pull flits before
the wire applied bw delay — half the flits bypassed bottleneck timing.

Fix: separate Stores per directed edge. Wire is the only conduit. Each
flit on the wire incurs chunk_time = flit_nbytes/bw_gbs once, in arrival
order. Multi-hop wormhole pipelining emerges naturally because
flit-aware pass-through (TransitComponent) forwards each flit serially
without reassembly.

64 KB MemoryWrite via UCIe 128 GB/s bottleneck: 273 ns (broken) → 545 ns
(matches drain 512 + commit 8 + path overheads). 1 MB: 8230 ns (matches
drain 8192). Single-flit transfer transport-time alone, exactly what
real-HW wormhole produces.

3 pre-existing tests now off by small margins or inverted:
- test_h2d_local_cube_cut_through: 65.53 vs threshold 65.0
- test_engine_override_is_scoped_to_impl: ZeroRouter inherits
  ComponentBase, not flit-aware, so override path reassembles at each
  hop while default doesn't
- test_intra_sip_critical_path_at_96k_below_threshold: 96KB allreduce
  microscopically over its threshold

Not weakening these to pass: they reflect model fidelity improvements
that need calibrated thresholds. To address in follow-up via test
threshold updates and ZeroRouter→TransitComponent inheritance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-14 22:43:40 -07:00
parent b31b3e8248
commit 4929040cf1
3 changed files with 138 additions and 25 deletions
+32 -18
View File
@@ -85,15 +85,22 @@ class GraphEngine:
for node_id, node in graph.nodes.items()
}
# Wire ports: one Store per directed edge (ADR-0015 D1)
# Wire ports: SEPARATE Stores for src.out_port and dst.in_port per
# directed edge (ADR-0015 D1, ADR-0033 Phase 2c). The wire process
# is the only conduit between them: pulls from src.out_port,
# processes per-flit timing, puts on dst.in_port. Using separate
# stores eliminates a race with `fan_in` that would otherwise let
# flits bypass wire's BW occupancy (fan_in could pull a flit from
# the same store before wire put it back delayed).
for e in graph.edges:
src_comp = self._components.get(e.src)
dst_comp = self._components.get(e.dst)
if src_comp is None or dst_comp is None:
continue
store: simpy.Store = simpy.Store(self._env)
src_comp.out_ports[e.dst] = store
dst_comp.in_ports[e.src] = store
out_store: simpy.Store = simpy.Store(self._env)
in_store: simpy.Store = simpy.Store(self._env)
src_comp.out_ports[e.dst] = out_store
dst_comp.in_ports[e.src] = in_store
# Wire processes: propagation delay + BW occupancy per edge (ADR-0015 D2)
# Cut-through (wormhole) model: wires apply propagation delay per hop.
@@ -267,25 +274,32 @@ class GraphEngine:
available_at = 0.0
while True:
msg = yield out_port.get()
# ADR-0033 Phase 2c-1: chunkify Transactions into Flits but
# emit atomically (same env.now) to preserve current timing.
# Phase 2c-2 will graduate to per-flit timing.
# ADR-0033 Phase 2c-2/3: per-flit transport timing.
# Transactions with payload chunkify into Flits; each flit
# occupies the wire for ``flit_nbytes/bw_gbs`` and is
# delivered after ``prop_ns + transfer_time``. Wormhole
# pipelining emerges naturally because downstream flit-aware
# components forward flits without reassembly.
if isinstance(msg, Transaction) and msg.nbytes > 0:
items = list(msg.into_flits(self._flit_bytes))
payload_nbytes = msg.nbytes
else:
items = [msg]
payload_nbytes = getattr(msg, "nbytes", 0) or 0
# BW occupancy: wait for link to become free, then mark busy
if bw_gbs > 0 and payload_nbytes > 0:
wait = available_at - self._env.now
if wait > 0:
yield self._env.timeout(wait)
available_at = self._env.now + (payload_nbytes / bw_gbs)
# Propagation delay
if prop_ns > 0:
yield self._env.timeout(prop_ns)
for item in items:
if isinstance(item, Flit):
item_nbytes = item.flit_nbytes
elif isinstance(item, Transaction):
item_nbytes = item.nbytes
else:
item_nbytes = getattr(item, "nbytes", 0) or 0
if bw_gbs > 0 and item_nbytes > 0:
wait = available_at - self._env.now
if wait > 0:
yield self._env.timeout(wait)
available_at = self._env.now + item_nbytes / bw_gbs
yield self._env.timeout(prop_ns + item_nbytes / bw_gbs)
else:
if prop_ns > 0:
yield self._env.timeout(prop_ns)
yield in_port.put(item)
def _process(self, key: str, request: Any, done: simpy.Event):