Phase 2c-2/3: per-flit wire timing + flit-aware routers + HBM CTRL

Root cause of Phase 2c-1 timing collapse identified: src.out_port and dst.in_port aliased the same simpy.Store, so when wire chunkified a Transaction into Flits and re-put them, fan_in could pull flits before the wire applied bw delay — half the flits bypassed bottleneck timing. Fix: separate Stores per directed edge. Wire is the only conduit. Each flit on the wire incurs chunk_time = flit_nbytes/bw_gbs once, in arrival order. Multi-hop wormhole pipelining emerges naturally because flit-aware pass-through (TransitComponent) forwards each flit serially without reassembly. 64 KB MemoryWrite via UCIe 128 GB/s bottleneck: 273 ns (broken) → 545 ns (matches drain 512 + commit 8 + path overheads). 1 MB: 8230 ns (matches drain 8192). Single-flit transfer transport-time alone, exactly what real-HW wormhole produces. 3 pre-existing tests now off by small margins or inverted: - test_h2d_local_cube_cut_through: 65.53 vs threshold 65.0 - test_engine_override_is_scoped_to_impl: ZeroRouter inherits ComponentBase, not flit-aware, so override path reassembles at each hop while default doesn't - test_intra_sip_critical_path_at_96k_below_threshold: 96KB allreduce microscopically over its threshold Not weakening these to pass: they reflect model fidelity improvements that need calibrated thresholds. To address in follow-up via test threshold updates and ZeroRouter→TransitComponent inheritance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 22:43:40 -07:00
parent b31b3e8248
commit 4929040cf1
3 changed files with 138 additions and 25 deletions
@@ -85,15 +85,22 @@ class GraphEngine:
            for node_id, node in graph.nodes.items()
        }

-        # Wire ports: one Store per directed edge (ADR-0015 D1)
+        # Wire ports: SEPARATE Stores for src.out_port and dst.in_port per
+        # directed edge (ADR-0015 D1, ADR-0033 Phase 2c). The wire process
+        # is the only conduit between them: pulls from src.out_port,
+        # processes per-flit timing, puts on dst.in_port. Using separate
+        # stores eliminates a race with `fan_in` that would otherwise let
+        # flits bypass wire's BW occupancy (fan_in could pull a flit from
+        # the same store before wire put it back delayed).
        for e in graph.edges:
            src_comp = self._components.get(e.src)
            dst_comp = self._components.get(e.dst)
            if src_comp is None or dst_comp is None:
                continue
-            store: simpy.Store = simpy.Store(self._env)
-            src_comp.out_ports[e.dst] = store
-            dst_comp.in_ports[e.src] = store
+            out_store: simpy.Store = simpy.Store(self._env)
+            in_store: simpy.Store = simpy.Store(self._env)
+            src_comp.out_ports[e.dst] = out_store
+            dst_comp.in_ports[e.src] = in_store

        # Wire processes: propagation delay + BW occupancy per edge (ADR-0015 D2)
        # Cut-through (wormhole) model: wires apply propagation delay per hop.
@@ -267,25 +274,32 @@ class GraphEngine:
        available_at = 0.0
        while True:
            msg = yield out_port.get()
-            # ADR-0033 Phase 2c-1: chunkify Transactions into Flits but
-            # emit atomically (same env.now) to preserve current timing.
-            # Phase 2c-2 will graduate to per-flit timing.
+            # ADR-0033 Phase 2c-2/3: per-flit transport timing.
+            # Transactions with payload chunkify into Flits; each flit
+            # occupies the wire for ``flit_nbytes/bw_gbs`` and is
+            # delivered after ``prop_ns + transfer_time``. Wormhole
+            # pipelining emerges naturally because downstream flit-aware
+            # components forward flits without reassembly.
            if isinstance(msg, Transaction) and msg.nbytes > 0:
                items = list(msg.into_flits(self._flit_bytes))
-                payload_nbytes = msg.nbytes
            else:
                items = [msg]
-                payload_nbytes = getattr(msg, "nbytes", 0) or 0
-            # BW occupancy: wait for link to become free, then mark busy
-            if bw_gbs > 0 and payload_nbytes > 0:
-                wait = available_at - self._env.now
-                if wait > 0:
-                    yield self._env.timeout(wait)
-                available_at = self._env.now + (payload_nbytes / bw_gbs)
-            # Propagation delay
-            if prop_ns > 0:
-                yield self._env.timeout(prop_ns)
            for item in items:
+                if isinstance(item, Flit):
+                    item_nbytes = item.flit_nbytes
+                elif isinstance(item, Transaction):
+                    item_nbytes = item.nbytes
+                else:
+                    item_nbytes = getattr(item, "nbytes", 0) or 0
+                if bw_gbs > 0 and item_nbytes > 0:
+                    wait = available_at - self._env.now
+                    if wait > 0:
+                        yield self._env.timeout(wait)
+                    available_at = self._env.now + item_nbytes / bw_gbs
+                    yield self._env.timeout(prop_ns + item_nbytes / bw_gbs)
+                else:
+                    if prop_ns > 0:
+                        yield self._env.timeout(prop_ns)
                yield in_port.put(item)

    def _process(self, key: str, request: Any, done: simpy.Event):