Add CHANGES.md, README, update SPEC/ADRs for release 2

- CHANGES.md: detailed changelog for release 1 and 2
- README.md: full project docs with install, probe, run, test usage
- SPEC.md: add ADR-0014~0017 references, update R7 for pcie_ep endpoint
- ADR-0003: update NOC description to reference ADR-0017
- ADR-0004: add HBM efficiency factor (0.8) to BW guarantee contract
- ADR-0014: status Proposed -> Accepted
- ADR-0015: update D4 to M_CPU bypass for Memory R/W, add ADR-0016/0017 links
- ADR-0016 (new): IOChiplet NOC and memory data path
- ADR-0017 (new): Cube NOC 2D mesh architecture
- Fix MD lint warnings (unfenced code blocks) across all docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-19 01:43:15 -07:00
parent d75da439c6
commit fc6abbc8ee
10 changed files with 613 additions and 65 deletions
+14 -14
View File
@@ -7,7 +7,7 @@ Every request flows through a graph of **components** connected by **wires**.
The total latency reported is the **actual SimPy wall-clock** (`env.now` delta),
not a static formula—so contention and queueing are captured automatically.
```
```text
total_ns (actual) = wire_prop + component_overhead + drain + queueing
├── deterministic ──────────────────┘ │
└── contention-dependent ────────────────────┘
@@ -17,7 +17,7 @@ total_ns (actual) = wire_prop + component_overhead + drain + queueing
### 1. Wire Propagation
```
```text
wire_ns = distance_mm × ns_per_mm (global: 0.01 = 10 ps/mm)
```
@@ -29,7 +29,7 @@ and negligible compared to other costs.
### 2. Component Overhead (`overhead_ns`)
```
```text
component_ns = node.attrs["overhead_ns"]
```
@@ -53,7 +53,7 @@ This models arbitration, protocol processing, pipeline stages, etc.
### 3. Drain (Serialization Delay)
```
```text
drain_ns = nbytes / bottleneck_bw_gbs
```
@@ -65,7 +65,7 @@ Example: 4096 bytes through a path with bottleneck 128 GB/s → `4096 / 128 = 32
### Formula (Theoretical Lower Bound)
```
```text
formula_ns = Σ(wire_prop) + Σ(overhead_ns) + drain_ns
```
@@ -159,7 +159,7 @@ a timeout or waits on a resource/store. The delta between start and done capture
Each component is a SimPy process:
```
```text
_fan_in (per in_port) → _inbox (Store) → _worker → out_ports
```
@@ -215,7 +215,7 @@ If request A holds the resource and request B arrives:
- SimPy advances B's `env.now` by A's remaining service time
- This "extra" time shows up in B's `total_ns` automatically
```
```text
No contention: actual_ns == formula_ns
Contention: actual_ns > formula_ns
queueing_delay = actual_ns - formula_ns
@@ -237,7 +237,7 @@ with self._resource.request() as req:
This means a short request arriving during a long request's drain must wait
for the full remaining drain time—classic head-of-line blocking:
```
```text
Request A: 4 KB, drain = 16.0 ns (arrives at t=0)
Request B: 64 B, drain = 0.25 ns (arrives at t=5)
@@ -274,7 +274,7 @@ Setup: PE0 and PE1 in cube0 both read 4096 bytes from their local HBM slices
### Paths
```
```text
DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0
DMA B: pe1.pe_dma → xbar.pe1 → hbm_ctrl.slice1
```
@@ -284,7 +284,7 @@ DMA B: pe1.pe_dma → xbar.pe1 → hbm_ctrl.slice1
Since slice0 and slice1 are **separate** hbm_ctrl instances, each with its own
`simpy.Resource(capacity=1)`, there is no resource competition.
```
```text
DMA A timeline:
t=0.00 pe_dma dequeues txn
t=0.00 xbar.pe0: overhead_ns=2.0 → t=2.00
@@ -304,13 +304,13 @@ Both complete at ~18.09 ns. `actual == formula` for both.
Now suppose both PE0 and PE1 read from **slice0**:
```
```text
DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0
DMA B: pe1.pe_dma → xbar.pe1 → xbar.pe0 → hbm_ctrl.slice0
(chain traversal to reach slice0)
```
```
```text
DMA A timeline:
t=0.00 xbar.pe0(2.0) → wire → hbm_ctrl.slice0
t=2.025 yield req → immediate (first to arrive)
@@ -343,7 +343,7 @@ compare `actual_ns` vs `formula_ns` (available in PE DMA traces).
## Probe Output Explained
```
```text
=== PE DMA Latency ===
Case Target Actual Ovhd Drain Wire Ovhd% Drain% Eff.BW BN.BW Util%
pe-local-hbm c0.pe0->c0.slice0 18.09 2.0 16.0 0.08 11.1% 88.5% 226.49 256.0 88.5%
@@ -368,7 +368,7 @@ pe-cross-half-hbm c0.pe0->c0.slice4 37.14 5.0 32.0 0.14 13.5% 86.1%
fraction. For small transfers (4KB), overhead is significant relative to drain.
For large transfers, drain dominates and utilization approaches 100%.
```
```text
4 KB: Ovhd=2.0, Drain=16.0 → Util=88.5% (overhead is 11% of time)
64 KB: Ovhd=2.0, Drain=256.0 → Util=99.2% (overhead is <1% of time)
```