Add CHANGES.md, README, update SPEC/ADRs for release 2
- CHANGES.md: detailed changelog for release 1 and 2 - README.md: full project docs with install, probe, run, test usage - SPEC.md: add ADR-0014~0017 references, update R7 for pcie_ep endpoint - ADR-0003: update NOC description to reference ADR-0017 - ADR-0004: add HBM efficiency factor (0.8) to BW guarantee contract - ADR-0014: status Proposed -> Accepted - ADR-0015: update D4 to M_CPU bypass for Memory R/W, add ADR-0016/0017 links - ADR-0016 (new): IOChiplet NOC and memory data path - ADR-0017 (new): Cube NOC 2D mesh architecture - Fix MD lint warnings (unfenced code blocks) across all docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
+14
-14
@@ -7,7 +7,7 @@ Every request flows through a graph of **components** connected by **wires**.
|
||||
The total latency reported is the **actual SimPy wall-clock** (`env.now` delta),
|
||||
not a static formula—so contention and queueing are captured automatically.
|
||||
|
||||
```
|
||||
```text
|
||||
total_ns (actual) = wire_prop + component_overhead + drain + queueing
|
||||
├── deterministic ──────────────────┘ │
|
||||
└── contention-dependent ────────────────────┘
|
||||
@@ -17,7 +17,7 @@ total_ns (actual) = wire_prop + component_overhead + drain + queueing
|
||||
|
||||
### 1. Wire Propagation
|
||||
|
||||
```
|
||||
```text
|
||||
wire_ns = distance_mm × ns_per_mm (global: 0.01 = 10 ps/mm)
|
||||
```
|
||||
|
||||
@@ -29,7 +29,7 @@ and negligible compared to other costs.
|
||||
|
||||
### 2. Component Overhead (`overhead_ns`)
|
||||
|
||||
```
|
||||
```text
|
||||
component_ns = node.attrs["overhead_ns"]
|
||||
```
|
||||
|
||||
@@ -53,7 +53,7 @@ This models arbitration, protocol processing, pipeline stages, etc.
|
||||
|
||||
### 3. Drain (Serialization Delay)
|
||||
|
||||
```
|
||||
```text
|
||||
drain_ns = nbytes / bottleneck_bw_gbs
|
||||
```
|
||||
|
||||
@@ -65,7 +65,7 @@ Example: 4096 bytes through a path with bottleneck 128 GB/s → `4096 / 128 = 32
|
||||
|
||||
### Formula (Theoretical Lower Bound)
|
||||
|
||||
```
|
||||
```text
|
||||
formula_ns = Σ(wire_prop) + Σ(overhead_ns) + drain_ns
|
||||
```
|
||||
|
||||
@@ -159,7 +159,7 @@ a timeout or waits on a resource/store. The delta between start and done capture
|
||||
|
||||
Each component is a SimPy process:
|
||||
|
||||
```
|
||||
```text
|
||||
_fan_in (per in_port) → _inbox (Store) → _worker → out_ports
|
||||
```
|
||||
|
||||
@@ -215,7 +215,7 @@ If request A holds the resource and request B arrives:
|
||||
- SimPy advances B's `env.now` by A's remaining service time
|
||||
- This "extra" time shows up in B's `total_ns` automatically
|
||||
|
||||
```
|
||||
```text
|
||||
No contention: actual_ns == formula_ns
|
||||
Contention: actual_ns > formula_ns
|
||||
queueing_delay = actual_ns - formula_ns
|
||||
@@ -237,7 +237,7 @@ with self._resource.request() as req:
|
||||
This means a short request arriving during a long request's drain must wait
|
||||
for the full remaining drain time—classic head-of-line blocking:
|
||||
|
||||
```
|
||||
```text
|
||||
Request A: 4 KB, drain = 16.0 ns (arrives at t=0)
|
||||
Request B: 64 B, drain = 0.25 ns (arrives at t=5)
|
||||
|
||||
@@ -274,7 +274,7 @@ Setup: PE0 and PE1 in cube0 both read 4096 bytes from their local HBM slices
|
||||
|
||||
### Paths
|
||||
|
||||
```
|
||||
```text
|
||||
DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0
|
||||
DMA B: pe1.pe_dma → xbar.pe1 → hbm_ctrl.slice1
|
||||
```
|
||||
@@ -284,7 +284,7 @@ DMA B: pe1.pe_dma → xbar.pe1 → hbm_ctrl.slice1
|
||||
Since slice0 and slice1 are **separate** hbm_ctrl instances, each with its own
|
||||
`simpy.Resource(capacity=1)`, there is no resource competition.
|
||||
|
||||
```
|
||||
```text
|
||||
DMA A timeline:
|
||||
t=0.00 pe_dma dequeues txn
|
||||
t=0.00 xbar.pe0: overhead_ns=2.0 → t=2.00
|
||||
@@ -304,13 +304,13 @@ Both complete at ~18.09 ns. `actual == formula` for both.
|
||||
|
||||
Now suppose both PE0 and PE1 read from **slice0**:
|
||||
|
||||
```
|
||||
```text
|
||||
DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0
|
||||
DMA B: pe1.pe_dma → xbar.pe1 → xbar.pe0 → hbm_ctrl.slice0
|
||||
(chain traversal to reach slice0)
|
||||
```
|
||||
|
||||
```
|
||||
```text
|
||||
DMA A timeline:
|
||||
t=0.00 xbar.pe0(2.0) → wire → hbm_ctrl.slice0
|
||||
t=2.025 yield req → immediate (first to arrive)
|
||||
@@ -343,7 +343,7 @@ compare `actual_ns` vs `formula_ns` (available in PE DMA traces).
|
||||
|
||||
## Probe Output Explained
|
||||
|
||||
```
|
||||
```text
|
||||
=== PE DMA Latency ===
|
||||
Case Target Actual Ovhd Drain Wire Ovhd% Drain% Eff.BW BN.BW Util%
|
||||
pe-local-hbm c0.pe0->c0.slice0 18.09 2.0 16.0 0.08 11.1% 88.5% 226.49 256.0 88.5%
|
||||
@@ -368,7 +368,7 @@ pe-cross-half-hbm c0.pe0->c0.slice4 37.14 5.0 32.0 0.14 13.5% 86.1%
|
||||
fraction. For small transfers (4KB), overhead is significant relative to drain.
|
||||
For large transfers, drain dominates and utilization approaches 100%.
|
||||
|
||||
```
|
||||
```text
|
||||
4 KB: Ovhd=2.0, Drain=16.0 → Util=88.5% (overhead is 11% of time)
|
||||
64 KB: Ovhd=2.0, Drain=256.0 → Util=99.2% (overhead is <1% of time)
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user