ADR-0033 D6: reorder future work by workload impact
Cycle-accurate arbitration policies (priority/iSLIP) downgraded to "academic / specific use cases" — FIFO inbox is approximately fair for typical similar-rate workloads (GEMM, AllReduce, data parallel). True impact appears only for QoS modeling or per-stream tail latency analysis under saturation. Higher-priority items pulled forward: address-based PC selection at HBM CTRL (directly affects multi-PE concurrent HBM contention), bank conflict modeling, HBM scheduler, finite buffer backpressure, op_log chunk-streaming integration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -106,36 +106,47 @@ Note: multi-stream merging at routers IS modeled correctly — each
|
||||
in_port has its own fan_in process, all push to a shared inbox, and
|
||||
the router worker forwards in inbox FIFO order. Flits from different
|
||||
upstream streams naturally interleave at flit granularity. The items
|
||||
below are different concerns.
|
||||
below are different concerns, ordered by expected workload impact.
|
||||
|
||||
**Higher impact (workload accuracy gap)**:
|
||||
|
||||
- [ ] **Cycle-accurate router arbitration policies** (RR with
|
||||
priorities, age, iSLIP). Currently the inbox FIFO order is used as
|
||||
a proxy for fair RR — works when flit arrival times differ slightly
|
||||
between streams, but doesn't reflect intentional priority/QoS.
|
||||
- [ ] **Sub-flit (32B) granularity** for finer wire arbitration
|
||||
cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates
|
||||
per 32B flit. Effect is small for most workloads (sub-flit timing
|
||||
noise).
|
||||
- [ ] **Address-based PC selection at HBM CTRL** (replace the
|
||||
address-blind global round-robin). When two transactions of size
|
||||
`num_pcs × burst_bytes` (e.g., 2KB at 8 PCs × 256B) arrive
|
||||
concurrently, both claim PCs 0..7 via global RR, producing full
|
||||
per-PC contention. Real HW uses address bits to select PCs, so
|
||||
different-address transactions hit different PC patterns. Address
|
||||
modeling would let the simulator reflect cache-line/page-aware
|
||||
layouts.
|
||||
per-PC contention even when real-HW address striping would put
|
||||
them on disjoint PC sets. Directly affects multi-PE concurrent
|
||||
HBM workload latencies.
|
||||
- [ ] **Bank-level conflict modeling** within a PC (opt-in via
|
||||
`track_banks: true`). Currently we assume no same-bank reuse.
|
||||
`track_banks: true`). Currently we assume no same-bank reuse;
|
||||
random scatter/gather workloads are optimistic here.
|
||||
- [ ] **HBM scheduler** with write buffer + watermark drain (Tier 2
|
||||
from the design discussion). Default `switch_penalty_ns=0` is the
|
||||
ideal-amortization stand-in.
|
||||
- [ ] **Backpressure** modeling for finite component buffers.
|
||||
ideal-amortization stand-in; bursty mixed R/W workloads benefit
|
||||
from explicit modeling.
|
||||
- [ ] **Backpressure** modeling for finite component buffers. Matters
|
||||
at high concurrency / sustained saturation where buffer occupancy
|
||||
causes upstream stalls.
|
||||
- [ ] **Op_log integration with chunk-streaming**: currently op_log
|
||||
fires on PE-internal command messages (DmaReadCmd, DmaWriteCmd,
|
||||
GemmCmd, MathCmd) which are not chunkified. Integration would
|
||||
require flit-aware components to also emit op_log start/end hooks
|
||||
per transaction (start on first flit, end on is_last).
|
||||
|
||||
**Lower impact (academic / specific use cases)**:
|
||||
|
||||
- [ ] **Cycle-accurate router arbitration policies** (RR with
|
||||
priorities, age, iSLIP). The FIFO inbox is already approximately
|
||||
fair when flit arrival times differ slightly between streams (the
|
||||
common case for similar-rate workloads). True impact appears only
|
||||
for: (a) priority/QoS modeling, (b) per-stream tail latency
|
||||
analysis under sustained saturation. Not critical for makespan or
|
||||
average-latency studies.
|
||||
- [ ] **Sub-flit (32B) granularity** for finer wire arbitration
|
||||
cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates
|
||||
per 32B flit. Effect is small for most workloads (sub-flit timing
|
||||
noise on small messages).
|
||||
|
||||
## Consequences
|
||||
|
||||
- Single review point for all model fidelity questions. Each future PR
|
||||
|
||||
Reference in New Issue
Block a user