Honest measured pipeline efficiency: two timing fixes

Two related issues caused measured pipeline efficiency to look
worse than the simulator's actual behavior:

1. DMA timing recorded too early. The op-log start timestamp
   for a DMA op fired when the request entered the queue, and
   the DMA channel was released as soon as the request was
   issued. Back-to-back DMAs therefore appeared to grab the
   channel simultaneously, with per-op duration drifting
   upward as queue depth grew - an artifact, not real cost.

   Fix: defer the start timestamp until after the channel is
   acquired, and hold the channel through the full HBM
   round-trip until the response returns. Per-op duration is
   now constant and equal to the actual transfer interval;
   serialization is visible as queue wait, not as inflated
   service time.

2. Sweep timing window folded in pre-composite work. The PE
   timing window spanned every PE engine record, which
   included the upfront pinned-operand DMA issued before the
   composite GEMM begins. For large-K shapes that one-shot
   load can be nearly half of the window, conflating
   operand-staging cost with composite-pipeline behavior.

   Fix: add a second window scoped to the composite pipeline
   by filtering op_log records to those tagged with a
   tile-pipeline stage; the legacy operand-load path is
   untagged and naturally excluded. For 32x3072x32 load_ref
   the window drops from 1765ns to 992ns and measured eff
   lines up with the steady-state DMA-bound stage limit
   instead of being penalized for the one-time load.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-14 14:19:17 -07:00
parent 83ea97b05f
commit f6d262e359
7 changed files with 543 additions and 263 deletions
+31 -10
View File
@@ -27,6 +27,12 @@ class PeDmaComponent(PeEngineBase):
(DmaReadCmd → HBM read, DmaWriteCmd → HBM write)
"""
# Defer op_log record_start until AFTER the DMA channel is acquired so
# t_start reflects the serve-start moment (post queueing) rather than
# the queue-enter moment. ComponentBase._handle_with_hooks consults this
# flag.
_DEFER_RECORD_START = True
def __init__(self, node: Node, ctx: ComponentContext | None = None) -> None:
super().__init__(node, ctx)
self._dma_read: simpy.Resource | None = None
@@ -80,9 +86,16 @@ class PeDmaComponent(PeEngineBase):
path = self.ctx.router.find_path(self._pe_prefix, dst_node)
drain_ns = self.ctx.compute_drain_ns(path, cmd.nbytes)
# Acquire DMA channel (command issue serialization)
# Acquire DMA channel — held through the entire round-trip so the
# channel models "one DMA in flight per PE per direction" rather
# than just issue-time serialization. This is what makes Option B
# meaningful: t_start = serve-start covers the actual transfer.
with dma_res.request() as req:
yield req
# Option B: record_start fires AFTER channel acquired, so t_start
# = serve-start (excludes queue wait). _DEFER_RECORD_START=True
# suppresses the auto-start in ComponentBase._handle_with_hooks.
self._on_process_start(env, cmd)
# Create sub-Transaction with PeDmaMsg (HbmCtrl handles it directly)
sub_done = env.event()
sub_request = PeDmaMsg(
@@ -99,10 +112,8 @@ class PeDmaComponent(PeEngineBase):
# Send to next hop (path[0] is pe_dma itself, path[1] is router)
if len(path) > 1:
yield self.out_ports[path[1]].put(sub_txn.advance())
# DMA channel released after issue
# Wait for HBM transfer completion
yield sub_done
# Wait for HBM transfer completion BEFORE releasing the channel.
yield sub_done
pe_txn.done.succeed()
def _worker(self, env: simpy.Environment) -> Generator:
@@ -293,15 +304,17 @@ class PeDmaComponent(PeEngineBase):
txn.done.succeed()
def _pipeline_process(self, env: simpy.Environment, token: Any) -> Generator:
"""Pipeline mode: DMA read/write via fabric, then self-route."""
self._on_process_start(env, token)
"""Pipeline mode: DMA read/write via fabric, then self-route.
Option B: record_start is fired *inside* _do_pipeline_dma, after the
DMA channel is acquired — record_end stays here.
"""
yield from self._do_pipeline_dma(env, token)
self._on_process_end(env, token)
# Self-routing (handle same-component consecutive stages)
next_stage = token.advance()
while next_stage is not None and next_stage.component == self.node.id:
self._on_process_start(env, token)
yield from self._do_pipeline_dma(env, token)
self._on_process_end(env, token)
next_stage = token.advance()
@@ -340,8 +353,13 @@ class PeDmaComponent(PeEngineBase):
path = self.ctx.router.find_path(self._pe_prefix, dst_node)
drain_ns = self.ctx.compute_drain_ns(path, nbytes)
# Hold dma_res through the full round-trip — one DMA in flight
# per PE per direction — so Option B's t_start (post-acquire)
# bounds the actual transfer interval.
with dma_res.request() as req:
yield req
# Option B: t_start = post-acquire moment.
self._on_process_start(env, token)
sub_done = env.event()
sub_request = PeDmaMsg(
correlation_id="pipeline",
@@ -356,8 +374,11 @@ class PeDmaComponent(PeEngineBase):
)
if len(path) > 1:
yield self.out_ports[path[1]].put(sub_txn.advance())
yield sub_done
yield sub_done
else:
# No-op (nbytes==0 or no ctx): no channel wait, but still record
# so _on_process_end has a matching pending entry to finalise.
self._on_process_start(env, token)
def _forward_txn(self, env: simpy.Environment, txn: Any) -> Generator:
"""Handle external Transaction (PeDmaMsg probe, M_CPU DMA) with channel acquisition."""