ADR-0033 D6: reorder future work by workload impact
Cycle-accurate arbitration policies (priority/iSLIP) downgraded to "academic / specific use cases" — FIFO inbox is approximately fair for typical similar-rate workloads (GEMM, AllReduce, data parallel). True impact appears only for QoS modeling or per-stream tail latency analysis under saturation. Higher-priority items pulled forward: address-based PC selection at HBM CTRL (directly affects multi-PE concurrent HBM contention), bank conflict modeling, HBM scheduler, finite buffer backpressure, op_log chunk-streaming integration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -106,36 +106,47 @@ Note: multi-stream merging at routers IS modeled correctly — each
|
|||||||
in_port has its own fan_in process, all push to a shared inbox, and
|
in_port has its own fan_in process, all push to a shared inbox, and
|
||||||
the router worker forwards in inbox FIFO order. Flits from different
|
the router worker forwards in inbox FIFO order. Flits from different
|
||||||
upstream streams naturally interleave at flit granularity. The items
|
upstream streams naturally interleave at flit granularity. The items
|
||||||
below are different concerns.
|
below are different concerns, ordered by expected workload impact.
|
||||||
|
|
||||||
|
**Higher impact (workload accuracy gap)**:
|
||||||
|
|
||||||
- [ ] **Cycle-accurate router arbitration policies** (RR with
|
|
||||||
priorities, age, iSLIP). Currently the inbox FIFO order is used as
|
|
||||||
a proxy for fair RR — works when flit arrival times differ slightly
|
|
||||||
between streams, but doesn't reflect intentional priority/QoS.
|
|
||||||
- [ ] **Sub-flit (32B) granularity** for finer wire arbitration
|
|
||||||
cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates
|
|
||||||
per 32B flit. Effect is small for most workloads (sub-flit timing
|
|
||||||
noise).
|
|
||||||
- [ ] **Address-based PC selection at HBM CTRL** (replace the
|
- [ ] **Address-based PC selection at HBM CTRL** (replace the
|
||||||
address-blind global round-robin). When two transactions of size
|
address-blind global round-robin). When two transactions of size
|
||||||
`num_pcs × burst_bytes` (e.g., 2KB at 8 PCs × 256B) arrive
|
`num_pcs × burst_bytes` (e.g., 2KB at 8 PCs × 256B) arrive
|
||||||
concurrently, both claim PCs 0..7 via global RR, producing full
|
concurrently, both claim PCs 0..7 via global RR, producing full
|
||||||
per-PC contention. Real HW uses address bits to select PCs, so
|
per-PC contention even when real-HW address striping would put
|
||||||
different-address transactions hit different PC patterns. Address
|
them on disjoint PC sets. Directly affects multi-PE concurrent
|
||||||
modeling would let the simulator reflect cache-line/page-aware
|
HBM workload latencies.
|
||||||
layouts.
|
|
||||||
- [ ] **Bank-level conflict modeling** within a PC (opt-in via
|
- [ ] **Bank-level conflict modeling** within a PC (opt-in via
|
||||||
`track_banks: true`). Currently we assume no same-bank reuse.
|
`track_banks: true`). Currently we assume no same-bank reuse;
|
||||||
|
random scatter/gather workloads are optimistic here.
|
||||||
- [ ] **HBM scheduler** with write buffer + watermark drain (Tier 2
|
- [ ] **HBM scheduler** with write buffer + watermark drain (Tier 2
|
||||||
from the design discussion). Default `switch_penalty_ns=0` is the
|
from the design discussion). Default `switch_penalty_ns=0` is the
|
||||||
ideal-amortization stand-in.
|
ideal-amortization stand-in; bursty mixed R/W workloads benefit
|
||||||
- [ ] **Backpressure** modeling for finite component buffers.
|
from explicit modeling.
|
||||||
|
- [ ] **Backpressure** modeling for finite component buffers. Matters
|
||||||
|
at high concurrency / sustained saturation where buffer occupancy
|
||||||
|
causes upstream stalls.
|
||||||
- [ ] **Op_log integration with chunk-streaming**: currently op_log
|
- [ ] **Op_log integration with chunk-streaming**: currently op_log
|
||||||
fires on PE-internal command messages (DmaReadCmd, DmaWriteCmd,
|
fires on PE-internal command messages (DmaReadCmd, DmaWriteCmd,
|
||||||
GemmCmd, MathCmd) which are not chunkified. Integration would
|
GemmCmd, MathCmd) which are not chunkified. Integration would
|
||||||
require flit-aware components to also emit op_log start/end hooks
|
require flit-aware components to also emit op_log start/end hooks
|
||||||
per transaction (start on first flit, end on is_last).
|
per transaction (start on first flit, end on is_last).
|
||||||
|
|
||||||
|
**Lower impact (academic / specific use cases)**:
|
||||||
|
|
||||||
|
- [ ] **Cycle-accurate router arbitration policies** (RR with
|
||||||
|
priorities, age, iSLIP). The FIFO inbox is already approximately
|
||||||
|
fair when flit arrival times differ slightly between streams (the
|
||||||
|
common case for similar-rate workloads). True impact appears only
|
||||||
|
for: (a) priority/QoS modeling, (b) per-stream tail latency
|
||||||
|
analysis under sustained saturation. Not critical for makespan or
|
||||||
|
average-latency studies.
|
||||||
|
- [ ] **Sub-flit (32B) granularity** for finer wire arbitration
|
||||||
|
cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates
|
||||||
|
per 32B flit. Effect is small for most workloads (sub-flit timing
|
||||||
|
noise on small messages).
|
||||||
|
|
||||||
## Consequences
|
## Consequences
|
||||||
|
|
||||||
- Single review point for all model fidelity questions. Each future PR
|
- Single review point for all model fidelity questions. Each future PR
|
||||||
|
|||||||
Reference in New Issue
Block a user