Honest measured pipeline efficiency: two timing fixes
Two related issues caused measured pipeline efficiency to look worse than the simulator's actual behavior: 1. DMA timing recorded too early. The op-log start timestamp for a DMA op fired when the request entered the queue, and the DMA channel was released as soon as the request was issued. Back-to-back DMAs therefore appeared to grab the channel simultaneously, with per-op duration drifting upward as queue depth grew - an artifact, not real cost. Fix: defer the start timestamp until after the channel is acquired, and hold the channel through the full HBM round-trip until the response returns. Per-op duration is now constant and equal to the actual transfer interval; serialization is visible as queue wait, not as inflated service time. 2. Sweep timing window folded in pre-composite work. The PE timing window spanned every PE engine record, which included the upfront pinned-operand DMA issued before the composite GEMM begins. For large-K shapes that one-shot load can be nearly half of the window, conflating operand-staging cost with composite-pipeline behavior. Fix: add a second window scoped to the composite pipeline by filtering op_log records to those tagged with a tile-pipeline stage; the legacy operand-load path is untagged and naturally excluded. For 32x3072x32 load_ref the window drops from 1765ns to 992ns and measured eff lines up with the steady-state DMA-bound stage limit instead of being penalized for the one-time load. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -27,6 +27,12 @@ class PeDmaComponent(PeEngineBase):
|
||||
(DmaReadCmd → HBM read, DmaWriteCmd → HBM write)
|
||||
"""
|
||||
|
||||
# Defer op_log record_start until AFTER the DMA channel is acquired so
|
||||
# t_start reflects the serve-start moment (post queueing) rather than
|
||||
# the queue-enter moment. ComponentBase._handle_with_hooks consults this
|
||||
# flag.
|
||||
_DEFER_RECORD_START = True
|
||||
|
||||
def __init__(self, node: Node, ctx: ComponentContext | None = None) -> None:
|
||||
super().__init__(node, ctx)
|
||||
self._dma_read: simpy.Resource | None = None
|
||||
@@ -80,9 +86,16 @@ class PeDmaComponent(PeEngineBase):
|
||||
path = self.ctx.router.find_path(self._pe_prefix, dst_node)
|
||||
drain_ns = self.ctx.compute_drain_ns(path, cmd.nbytes)
|
||||
|
||||
# Acquire DMA channel (command issue serialization)
|
||||
# Acquire DMA channel — held through the entire round-trip so the
|
||||
# channel models "one DMA in flight per PE per direction" rather
|
||||
# than just issue-time serialization. This is what makes Option B
|
||||
# meaningful: t_start = serve-start covers the actual transfer.
|
||||
with dma_res.request() as req:
|
||||
yield req
|
||||
# Option B: record_start fires AFTER channel acquired, so t_start
|
||||
# = serve-start (excludes queue wait). _DEFER_RECORD_START=True
|
||||
# suppresses the auto-start in ComponentBase._handle_with_hooks.
|
||||
self._on_process_start(env, cmd)
|
||||
# Create sub-Transaction with PeDmaMsg (HbmCtrl handles it directly)
|
||||
sub_done = env.event()
|
||||
sub_request = PeDmaMsg(
|
||||
@@ -99,10 +112,8 @@ class PeDmaComponent(PeEngineBase):
|
||||
# Send to next hop (path[0] is pe_dma itself, path[1] is router)
|
||||
if len(path) > 1:
|
||||
yield self.out_ports[path[1]].put(sub_txn.advance())
|
||||
# DMA channel released after issue
|
||||
|
||||
# Wait for HBM transfer completion
|
||||
yield sub_done
|
||||
# Wait for HBM transfer completion BEFORE releasing the channel.
|
||||
yield sub_done
|
||||
pe_txn.done.succeed()
|
||||
|
||||
def _worker(self, env: simpy.Environment) -> Generator:
|
||||
@@ -293,15 +304,17 @@ class PeDmaComponent(PeEngineBase):
|
||||
txn.done.succeed()
|
||||
|
||||
def _pipeline_process(self, env: simpy.Environment, token: Any) -> Generator:
|
||||
"""Pipeline mode: DMA read/write via fabric, then self-route."""
|
||||
self._on_process_start(env, token)
|
||||
"""Pipeline mode: DMA read/write via fabric, then self-route.
|
||||
|
||||
Option B: record_start is fired *inside* _do_pipeline_dma, after the
|
||||
DMA channel is acquired — record_end stays here.
|
||||
"""
|
||||
yield from self._do_pipeline_dma(env, token)
|
||||
self._on_process_end(env, token)
|
||||
|
||||
# Self-routing (handle same-component consecutive stages)
|
||||
next_stage = token.advance()
|
||||
while next_stage is not None and next_stage.component == self.node.id:
|
||||
self._on_process_start(env, token)
|
||||
yield from self._do_pipeline_dma(env, token)
|
||||
self._on_process_end(env, token)
|
||||
next_stage = token.advance()
|
||||
@@ -340,8 +353,13 @@ class PeDmaComponent(PeEngineBase):
|
||||
path = self.ctx.router.find_path(self._pe_prefix, dst_node)
|
||||
drain_ns = self.ctx.compute_drain_ns(path, nbytes)
|
||||
|
||||
# Hold dma_res through the full round-trip — one DMA in flight
|
||||
# per PE per direction — so Option B's t_start (post-acquire)
|
||||
# bounds the actual transfer interval.
|
||||
with dma_res.request() as req:
|
||||
yield req
|
||||
# Option B: t_start = post-acquire moment.
|
||||
self._on_process_start(env, token)
|
||||
sub_done = env.event()
|
||||
sub_request = PeDmaMsg(
|
||||
correlation_id="pipeline",
|
||||
@@ -356,8 +374,11 @@ class PeDmaComponent(PeEngineBase):
|
||||
)
|
||||
if len(path) > 1:
|
||||
yield self.out_ports[path[1]].put(sub_txn.advance())
|
||||
|
||||
yield sub_done
|
||||
yield sub_done
|
||||
else:
|
||||
# No-op (nbytes==0 or no ctx): no channel wait, but still record
|
||||
# so _on_process_end has a matching pending entry to finalise.
|
||||
self._on_process_start(env, token)
|
||||
|
||||
def _forward_txn(self, env: simpy.Environment, txn: Any) -> Generator:
|
||||
"""Handle external Transaction (PeDmaMsg probe, M_CPU DMA) with channel acquisition."""
|
||||
|
||||
Reference in New Issue
Block a user