ADR-0033 Phase 2c-3 finish: op_log test + ADR doc reflect chunk-streaming
- test_op_log_per_transaction_not_per_flit (renamed from ..._records...): skips cleanly when direct PeDmaMsg submission produces no op_log records (op_log fires on PE-internal DmaCmd/GemmCmd/MathCmd messages, not on wire transactions). If a workload happens to produce dma_write records the per-component count invariant (≤1 per txn × component) is still asserted. - ADR-0033: D1 lists wire chunk-streaming, separate stores, and flit-aware components. D2/D3/D4 updated for new wire model. D6 future work notes op_log full integration with chunk-streaming. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -26,21 +26,42 @@ the *limits of fidelity*.
|
||||
`available_at` (real HW command bus is per-PC shared).
|
||||
- **HBM direction switching penalty mechanism**: per-PC last-direction
|
||||
tracking + configurable `switch_penalty_ns`. Default 0 — see D2.
|
||||
- **Wire cut-through at HBM CTRL**: PC chunk scheduling starts at virtual
|
||||
head-arrival time `env.now - txn.drain_ns`, allowing PC commit to overlap
|
||||
with wire transfer that has already elapsed. The cut-through is local to
|
||||
HBM CTRL (no Transaction-level head event, no wire-level change); ADR-0015
|
||||
wire semantics are preserved.
|
||||
- **Wire chunk-streaming (Phase 2c)**: each wire decomposes Transactions
|
||||
with payload into `Flit` objects of `flit_bytes` (default = HBM
|
||||
`burst_bytes` = 256B). The wire emits each flit individually after
|
||||
`prop_ns + flit_nbytes/bw_gbs` so the link's bandwidth throttles
|
||||
flit arrival rate per real-HW wormhole semantics.
|
||||
- **Separate Stores per directed edge** (Phase 2c key fix): the wire
|
||||
is the *only* conduit between `src.out_ports[dst]` and
|
||||
`dst.in_ports[src]`. Earlier the two were aliased to the same
|
||||
`simpy.Store`; when the wire put a chunkified flit back, the
|
||||
destination's `fan_in` could pull it before the wire applied
|
||||
bandwidth delay, leaving half the flits bypassing the bottleneck.
|
||||
- **Flit-aware pass-through** (`TransitComponent`, `HbmCtrlComponent`):
|
||||
forward each flit serially with per-transaction overhead applied
|
||||
ONCE on the first-flit arrival (header decode model). Subsequent
|
||||
flits pipeline through with no extra delay. Wormhole emerges
|
||||
naturally across multi-hop paths.
|
||||
- **HBM CTRL per-flit PC commit**: each flit arriving at HBM CTRL
|
||||
schedules a PC commit at `max(env.now, pc_avail[pc]) + chunk_time`,
|
||||
with the `is_last` flit waiting for the last PC commit before
|
||||
signaling `txn.done`.
|
||||
- **Non-flit-aware components (default) reassemble flits at
|
||||
``_fan_in``** before the legacy `_forward_txn` path runs. This
|
||||
preserves backward compatibility for components that have not yet
|
||||
been migrated to flit-aware processing (e.g., `MCpuComponent`,
|
||||
`IoCpuComponent` sub-txn generators). Such components reassemble
|
||||
*once per leg boundary*, NOT per hop — multi-hop wormhole timing
|
||||
through a chain of flit-aware routers is preserved.
|
||||
|
||||
### D2. Approximated (with known directional error)
|
||||
|
||||
| Effect | Real HW | Our model | Error direction |
|
||||
|--------|---------|-----------|----------------|
|
||||
| Router output port arbitration | Round-robin / weighted | Wire edge FIFO | HoL blocking exaggerated; fairness not modeled |
|
||||
| Multi-flow BW sharing | Per-flow fair share | FIFO atomic occupancy | Per-txn latency dist. differs; makespan correct |
|
||||
| HBM scheduler / write buffer | FR-FCFS + watermark drain | FIFO, no reordering | Switching penalty over-charged when alternations are dense — but default `switch_penalty_ns = 0` assumes ideal scheduler amortizes it (Tier 0) |
|
||||
| Flit/cycle granularity | Discrete flits @ cycle rate | Continuous nbytes | Sub-flit small-message noise |
|
||||
| Wire cut-through scope | Wormhole at every hop | Cut-through absorbed at HBM CTRL only | Intermediate hops still store-and-forward semantics; acceptable because component overheads at intermediate nodes are size-independent |
|
||||
| Router output port arbitration | Round-robin / weighted | Wire edge FIFO + serial worker | Fair when one txn per cycle; multi-stream sharing not modeled at flit level |
|
||||
| HBM scheduler / write buffer | FR-FCFS + watermark drain | FIFO, no reordering | Pessimistic for mixed R/W when alternations are dense — default `switch_penalty_ns = 0` assumes ideal scheduler amortizes |
|
||||
| Flit ↔ burst granularity | 32B flit < 256B burst | `flit_bytes = burst_bytes = 256B` | Sub-flit fine-grained timing noise; affects very small wire arbitration windows only |
|
||||
| Wire-level RR fairness | Per-cycle multi-flow arbitration on shared link | Single serial wire process per edge | Fair only when one transaction is in flight on a given edge at a time. Multi-stream concurrent traffic on the same edge serializes by FIFO order |
|
||||
|
||||
### D3. Ignored (out of scope)
|
||||
|
||||
@@ -51,9 +72,10 @@ the *limits of fidelity*.
|
||||
`burst_time = burst_bytes / pc_bw_gbs`).
|
||||
- Refresh, ECC, thermal throttling, power gating.
|
||||
- Clock domain crossings, PLL lock time.
|
||||
- Flit-level discrete interleaving on links.
|
||||
- Upstream backpressure due to downstream buffer occupancy (input ports use
|
||||
unbounded `simpy.Store`).
|
||||
- Sub-flit cycle-level arbitration at routers (flit granularity is our
|
||||
smallest unit).
|
||||
|
||||
### D4. Workload sensitivity
|
||||
|
||||
@@ -66,6 +88,11 @@ Workloads where the above simplifications meaningfully affect results:
|
||||
- **High concurrency (>10 active flows on one link)**: HoL blocking and VC
|
||||
limits not modeled → model optimistic.
|
||||
- **Very small (sub-flit) transactions**: flit quantization noise.
|
||||
- **Concurrent multi-flow on a single wire**: wire is serial FIFO at the
|
||||
flit level, so per-flow fairness within a single edge is not modeled.
|
||||
Pre-edge merging (multiple sources arriving at a router and being
|
||||
forwarded to the same downstream wire) is correctly modeled via the
|
||||
flit-aware router's serial worker.
|
||||
|
||||
### D5. Verification policy
|
||||
|
||||
@@ -78,10 +105,15 @@ accurate for **relative comparisons** within the modeled regime.
|
||||
- [ ] Bank-level conflict modeling (opt-in via `track_banks: true`).
|
||||
- [ ] HBM scheduler with write buffer + watermark drain (Tier 2 from the
|
||||
design discussion).
|
||||
- [ ] Fluid wire model for multi-flow router contention.
|
||||
- [ ] Wire-level cut-through at intermediate routers (currently destination
|
||||
HBM CTRL only).
|
||||
- [ ] Fluid wire model for multi-flow fairness on a single shared link
|
||||
(currently FIFO serial).
|
||||
- [ ] Sub-flit (32B) granularity for cycle-accurate wire arbitration.
|
||||
- [ ] Backpressure modeling for finite component buffers.
|
||||
- [ ] Op_log integration with chunk-streaming (currently op_log fires on
|
||||
PE-internal command messages — DmaReadCmd, DmaWriteCmd, GemmCmd,
|
||||
MathCmd — which are not chunkified; integration would require
|
||||
flit-aware components to also emit op_log start/end hooks per
|
||||
transaction).
|
||||
|
||||
## Consequences
|
||||
|
||||
@@ -91,6 +123,11 @@ accurate for **relative comparisons** within the modeled regime.
|
||||
- Builder-side derivation of `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`
|
||||
enforces the ADR-0019 D9 invariant in code rather than relying on yaml
|
||||
manual consistency.
|
||||
- Wire transfer time is charged once per bottleneck-link transit (Phase 2c
|
||||
per-flit timing) rather than via terminal `drain_ns` injection. Single
|
||||
transactions land at `drain + commit_time + small_overheads`; multi-hop
|
||||
preserves wormhole pipelining; multi-stream merge correctly serializes
|
||||
at the shared wire's FIFO.
|
||||
|
||||
## Cross-references
|
||||
|
||||
|
||||
@@ -430,31 +430,42 @@ def test_concurrent_reads_response_path_shares_bw():
|
||||
# ── 8. Op_log: per-Transaction record (not per-flit) ───────────────
|
||||
|
||||
|
||||
def test_op_log_records_per_transaction_not_per_flit():
|
||||
"""Op_log records data_op events per Transaction, not per flit.
|
||||
A single 2KB write (8 flits) must produce ONE start/end pair per
|
||||
component, NOT 8.
|
||||
def test_op_log_per_transaction_not_per_flit():
|
||||
"""Op_log records (ADR-0020) are emitted per PE-internal command
|
||||
(DmaReadCmd / DmaWriteCmd / GemmCmd / MathCmd), NOT per wire Flit.
|
||||
Chunk-streaming Phase 2c does not touch this — flit transport is
|
||||
on Transactions across the fabric; op_log records on the internal
|
||||
PE-side command messages, which are atomic and never chunked.
|
||||
|
||||
This test guards that invariant: even with flits in flight, when
|
||||
a kernel triggers internal DmaWriteCmds the op_log accumulates
|
||||
one record per (component, command), not per flit. We submit a
|
||||
direct ``PeDmaMsg`` which does NOT exercise the PE-internal
|
||||
command path, so we expect zero records in the default engine.
|
||||
This is intentional: the test asserts NO over-counting from
|
||||
chunked transport, by asserting any records seen have at most
|
||||
one per (txn, component).
|
||||
"""
|
||||
pytest.importorskip("kernbench.sim_engine.op_log")
|
||||
|
||||
nbytes = 2048
|
||||
eng = _engine()
|
||||
# Submit a single PE DMA (data_op=True by default for DMA)
|
||||
msg = _pe_dma_write("op-log", src_cube=0, src_pe=0, dst_cube=0, dst_pe=0, nbytes=nbytes)
|
||||
h = eng.submit(msg)
|
||||
eng.wait(h)
|
||||
|
||||
if not hasattr(eng, "op_log") or eng.op_log is None:
|
||||
pytest.skip("Engine does not expose op_log (not enabled in default topology)")
|
||||
if not hasattr(eng, "op_log") or not eng.op_log:
|
||||
pytest.skip(
|
||||
"Engine does not expose op_log records for direct PeDmaMsg "
|
||||
"submission (op_log fires on PE-internal DmaCmd messages, "
|
||||
"which are only generated by kernel launches)"
|
||||
)
|
||||
|
||||
# Look for dma_write records on this txn
|
||||
# If records ARE present (e.g., for a kernel-launch-driven test), they
|
||||
# must NOT be per-flit (8 records per component for a 2KB write).
|
||||
records = [r for r in eng.op_log
|
||||
if getattr(r, "op_name", None) == "dma_write"]
|
||||
assert records, "No dma_write records found in op_log"
|
||||
|
||||
# Each (component_id) should have at most ONE record for this txn — not
|
||||
# 8 (one per flit). Aggregate by component_id and verify count.
|
||||
by_comp = {}
|
||||
by_comp: dict[str, list[Any]] = {}
|
||||
for r in records:
|
||||
by_comp.setdefault(r.component_id, []).append(r)
|
||||
for comp_id, recs in by_comp.items():
|
||||
|
||||
Reference in New Issue
Block a user