Honest measured pipeline efficiency: two timing fixes

Two related issues caused measured pipeline efficiency to look worse than the simulator's actual behavior: 1. DMA timing recorded too early. The op-log start timestamp for a DMA op fired when the request entered the queue, and the DMA channel was released as soon as the request was issued. Back-to-back DMAs therefore appeared to grab the channel simultaneously, with per-op duration drifting upward as queue depth grew - an artifact, not real cost. Fix: defer the start timestamp until after the channel is acquired, and hold the channel through the full HBM round-trip until the response returns. Per-op duration is now constant and equal to the actual transfer interval; serialization is visible as queue wait, not as inflated service time. 2. Sweep timing window folded in pre-composite work. The PE timing window spanned every PE engine record, which included the upfront pinned-operand DMA issued before the composite GEMM begins. For large-K shapes that one-shot load can be nearly half of the window, conflating operand-staging cost with composite-pipeline behavior. Fix: add a second window scoped to the composite pipeline by filtering op_log records to those tagged with a tile-pipeline stage; the legacy operand-load path is untagged and naturally excluded. For 32x3072x32 load_ref the window drops from 1765ns to 992ns and measured eff lines up with the steady-state DMA-bound stage limit instead of being penalized for the one-time load. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 14:19:17 -07:00
parent 83ea97b05f
commit f6d262e359
7 changed files with 543 additions and 263 deletions
@@ -27,6 +27,12 @@ class PeDmaComponent(PeEngineBase):
        (DmaReadCmd → HBM read, DmaWriteCmd → HBM write)
    """

+    # Defer op_log record_start until AFTER the DMA channel is acquired so
+    # t_start reflects the serve-start moment (post queueing) rather than
+    # the queue-enter moment. ComponentBase._handle_with_hooks consults this
+    # flag.
+    _DEFER_RECORD_START = True
+
    def __init__(self, node: Node, ctx: ComponentContext | None = None) -> None:
        super().__init__(node, ctx)
        self._dma_read: simpy.Resource | None = None
@@ -80,9 +86,16 @@ class PeDmaComponent(PeEngineBase):
        path = self.ctx.router.find_path(self._pe_prefix, dst_node)
        drain_ns = self.ctx.compute_drain_ns(path, cmd.nbytes)

-        # Acquire DMA channel (command issue serialization)
+        # Acquire DMA channel — held through the entire round-trip so the
+        # channel models "one DMA in flight per PE per direction" rather
+        # than just issue-time serialization. This is what makes Option B
+        # meaningful: t_start = serve-start covers the actual transfer.
        with dma_res.request() as req:
            yield req
+            # Option B: record_start fires AFTER channel acquired, so t_start
+            # = serve-start (excludes queue wait). _DEFER_RECORD_START=True
+            # suppresses the auto-start in ComponentBase._handle_with_hooks.
+            self._on_process_start(env, cmd)
            # Create sub-Transaction with PeDmaMsg (HbmCtrl handles it directly)
            sub_done = env.event()
            sub_request = PeDmaMsg(
@@ -99,10 +112,8 @@ class PeDmaComponent(PeEngineBase):
            # Send to next hop (path[0] is pe_dma itself, path[1] is router)
            if len(path) > 1:
                yield self.out_ports[path[1]].put(sub_txn.advance())
-        # DMA channel released after issue
-
-        # Wait for HBM transfer completion
-        yield sub_done
+            # Wait for HBM transfer completion BEFORE releasing the channel.
+            yield sub_done
        pe_txn.done.succeed()

    def _worker(self, env: simpy.Environment) -> Generator:
@@ -293,15 +304,17 @@ class PeDmaComponent(PeEngineBase):
            txn.done.succeed()

    def _pipeline_process(self, env: simpy.Environment, token: Any) -> Generator:
-        """Pipeline mode: DMA read/write via fabric, then self-route."""
-        self._on_process_start(env, token)
+        """Pipeline mode: DMA read/write via fabric, then self-route.
+
+        Option B: record_start is fired *inside* _do_pipeline_dma, after the
+        DMA channel is acquired — record_end stays here.
+        """
        yield from self._do_pipeline_dma(env, token)
        self._on_process_end(env, token)

        # Self-routing (handle same-component consecutive stages)
        next_stage = token.advance()
        while next_stage is not None and next_stage.component == self.node.id:
-            self._on_process_start(env, token)
            yield from self._do_pipeline_dma(env, token)
            self._on_process_end(env, token)
            next_stage = token.advance()
@@ -340,8 +353,13 @@ class PeDmaComponent(PeEngineBase):
            path = self.ctx.router.find_path(self._pe_prefix, dst_node)
            drain_ns = self.ctx.compute_drain_ns(path, nbytes)

+            # Hold dma_res through the full round-trip — one DMA in flight
+            # per PE per direction — so Option B's t_start (post-acquire)
+            # bounds the actual transfer interval.
            with dma_res.request() as req:
                yield req
+                # Option B: t_start = post-acquire moment.
+                self._on_process_start(env, token)
                sub_done = env.event()
                sub_request = PeDmaMsg(
                    correlation_id="pipeline",
@@ -356,8 +374,11 @@ class PeDmaComponent(PeEngineBase):
                )
                if len(path) > 1:
                    yield self.out_ports[path[1]].put(sub_txn.advance())
-
-            yield sub_done
+                yield sub_done
+        else:
+            # No-op (nbytes==0 or no ctx): no channel wait, but still record
+            # so _on_process_end has a matching pending entry to finalise.
+            self._on_process_start(env, token)

    def _forward_txn(self, env: simpy.Environment, txn: Any) -> Generator:
        """Handle external Transaction (PeDmaMsg probe, M_CPU DMA) with channel acquisition."""