Add CHANGES.md, README, update SPEC/ADRs for release 2

- CHANGES.md: detailed changelog for release 1 and 2 - README.md: full project docs with install, probe, run, test usage - SPEC.md: add ADR-0014~0017 references, update R7 for pcie_ep endpoint - ADR-0003: update NOC description to reference ADR-0017 - ADR-0004: add HBM efficiency factor (0.8) to BW guarantee contract - ADR-0014: status Proposed -> Accepted - ADR-0015: update D4 to M_CPU bypass for Memory R/W, add ADR-0016/0017 links - ADR-0016 (new): IOChiplet NOC and memory data path - ADR-0017 (new): Cube NOC 2D mesh architecture - Fix MD lint warnings (unfenced code blocks) across all docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 01:43:15 -07:00
parent d75da439c6
commit fc6abbc8ee
10 changed files with 613 additions and 65 deletions
@@ -7,7 +7,7 @@ Every request flows through a graph of **components** connected by **wires**.
 The total latency reported is the **actual SimPy wall-clock** (`env.now` delta),
 not a static formula—so contention and queueing are captured automatically.

-```
+```text
 total_ns (actual) = wire_prop + component_overhead + drain + queueing
                    ├── deterministic ──────────────────┘       │
                    └── contention-dependent ────────────────────┘
@@ -17,7 +17,7 @@ total_ns (actual) = wire_prop + component_overhead + drain + queueing

 ### 1. Wire Propagation

-```
+```text
 wire_ns = distance_mm × ns_per_mm       (global: 0.01 = 10 ps/mm)
 ```

@@ -29,7 +29,7 @@ and negligible compared to other costs.

 ### 2. Component Overhead (`overhead_ns`)

-```
+```text
 component_ns = node.attrs["overhead_ns"]
 ```

@@ -53,7 +53,7 @@ This models arbitration, protocol processing, pipeline stages, etc.

 ### 3. Drain (Serialization Delay)

-```
+```text
 drain_ns = nbytes / bottleneck_bw_gbs
 ```

@@ -65,7 +65,7 @@ Example: 4096 bytes through a path with bottleneck 128 GB/s → `4096 / 128 = 32

 ### Formula (Theoretical Lower Bound)

-```
+```text
 formula_ns = Σ(wire_prop) + Σ(overhead_ns) + drain_ns
 ```

@@ -159,7 +159,7 @@ a timeout or waits on a resource/store. The delta between start and done capture

 Each component is a SimPy process:

-```
+```text
 _fan_in (per in_port)  →  _inbox (Store)  →  _worker  →  out_ports
 ```

@@ -215,7 +215,7 @@ If request A holds the resource and request B arrives:
 - SimPy advances B's `env.now` by A's remaining service time
 - This "extra" time shows up in B's `total_ns` automatically

-```
+```text
 No contention:  actual_ns == formula_ns
 Contention:     actual_ns  > formula_ns
                queueing_delay = actual_ns - formula_ns
@@ -237,7 +237,7 @@ with self._resource.request() as req:
 This means a short request arriving during a long request's drain must wait
 for the full remaining drain time—classic head-of-line blocking:

-```
+```text
 Request A: 4 KB,  drain = 16.0 ns   (arrives at t=0)
 Request B: 64 B,  drain = 0.25 ns   (arrives at t=5)

@@ -274,7 +274,7 @@ Setup: PE0 and PE1 in cube0 both read 4096 bytes from their local HBM slices

 ### Paths

-```
+```text
 DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0
 DMA B: pe1.pe_dma → xbar.pe1 → hbm_ctrl.slice1
 ```
@@ -284,7 +284,7 @@ DMA B: pe1.pe_dma → xbar.pe1 → hbm_ctrl.slice1
 Since slice0 and slice1 are **separate** hbm_ctrl instances, each with its own
 `simpy.Resource(capacity=1)`, there is no resource competition.

-```
+```text
 DMA A timeline:
  t=0.00   pe_dma dequeues txn
  t=0.00   xbar.pe0: overhead_ns=2.0 → t=2.00
@@ -304,13 +304,13 @@ Both complete at ~18.09 ns. `actual == formula` for both.

 Now suppose both PE0 and PE1 read from **slice0**:

-```
+```text
 DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0
 DMA B: pe1.pe_dma → xbar.pe1 → xbar.pe0 → hbm_ctrl.slice0
                                (chain traversal to reach slice0)
 ```

-```
+```text
 DMA A timeline:
  t=0.00   xbar.pe0(2.0) → wire → hbm_ctrl.slice0
  t=2.025  yield req → immediate (first to arrive)
@@ -343,7 +343,7 @@ compare `actual_ns` vs `formula_ns` (available in PE DMA traces).

 ## Probe Output Explained

-```
+```text
 === PE DMA Latency ===
 Case                Target              Actual  Ovhd  Drain  Wire  Ovhd% Drain%  Eff.BW   BN.BW   Util%
 pe-local-hbm        c0.pe0->c0.slice0    18.09   2.0  16.0  0.08  11.1% 88.5%   226.49   256.0   88.5%
@@ -368,7 +368,7 @@ pe-cross-half-hbm   c0.pe0->c0.slice4    37.14   5.0  32.0  0.14  13.5% 86.1%
 fraction. For small transfers (4KB), overhead is significant relative to drain.
 For large transfers, drain dominates and utilization approaches 100%.

-```
+```text
  4 KB:  Ovhd=2.0, Drain=16.0  → Util=88.5%   (overhead is 11% of time)
 64 KB:  Ovhd=2.0, Drain=256.0 → Util=99.2%   (overhead is <1% of time)
 ```