Add probe CLI improvements, D2H read, UCIe/HBM tuning, BW sweep
- Probe CLI: restructured output (tables first, routes below), per-hop timestamps, split cross-cube into best/worst cases, D2H read section - UCIe overhead: 1ns -> 8ns per port (16ns per crossing) to fix cross-cube-best < cross-half latency inversion - HBM efficiency: added efficiency=0.8 factor to hbm_ctrl, reducing effective BW from 256 to 204.8 GB/s - Multi-size BW sweep: saturation tables (4KB-1MB) for all probe cases - Probe default data size: 4KB -> 32KB for more realistic measurements - IOChiplet NOC + D2H topology and tests - NOC mesh, xbar, BW occupancy components and tests - Cube mesh visualization diagram 278 tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -44,9 +44,9 @@ This models arbitration, protocol processing, pipeline stages, etc.
|
||||
| fabric switch | 5.0 | Packet arbitration |
|
||||
| xbar | 2.0 | Crossbar arbitration |
|
||||
| xbar bridge | 1.0 | Bridge traversal between xbar halves |
|
||||
| ucie | 1.0 | UCIe protocol overhead per port |
|
||||
| ucie | 8.0 | UCIe protocol overhead per port (TX or RX; 16ns per crossing) |
|
||||
| noc (2D mesh) | 0.0 | Hop delay modeled internally via manhattan distance |
|
||||
| hbm_ctrl | 0.0 | Access time captured in drain_ns |
|
||||
| hbm_ctrl | 0.0 | Access time via drain_ns; efficiency=0.8 reduces edge BW (256→204.8) |
|
||||
| pe_cpu | 2.0 | Command dispatch |
|
||||
| pe_scheduler | 1.0 | PE-internal scheduling |
|
||||
| pe_gemm/math | 0.0 | Placeholder; will use flops-based model |
|
||||
|
||||
Reference in New Issue
Block a user