Add CHANGES.md, README, update SPEC/ADRs for release 2
- CHANGES.md: detailed changelog for release 1 and 2 - README.md: full project docs with install, probe, run, test usage - SPEC.md: add ADR-0014~0017 references, update R7 for pcie_ep endpoint - ADR-0003: update NOC description to reference ADR-0017 - ADR-0004: add HBM efficiency factor (0.8) to BW guarantee contract - ADR-0014: status Proposed -> Accepted - ADR-0015: update D4 to M_CPU bypass for Memory R/W, add ADR-0016/0017 links - ADR-0016 (new): IOChiplet NOC and memory data path - ADR-0017 (new): Cube NOC 2D mesh architecture - Fix MD lint warnings (unfenced code blocks) across all docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -17,8 +17,8 @@ implementation does not enforce this for fabric traversal.
|
||||
This ADR defines:
|
||||
|
||||
- how components communicate via typed port queues,
|
||||
- how propagation delay is modeled (wire processes),
|
||||
- the fabric path for Memory R/W through M_CPU.DMA,
|
||||
- how propagation delay is modeled (wire processes with BW occupancy),
|
||||
- the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch (via M_CPU),
|
||||
- the reduced role of the simulation engine,
|
||||
- M_CPU.DMA as an internal subcomponent of M_CPU.
|
||||
|
||||
@@ -30,7 +30,7 @@ This ADR defines:
|
||||
|
||||
Each component has typed input/output ports modeled as SimPy Stores:
|
||||
|
||||
```
|
||||
```text
|
||||
in_ports: dict[str, simpy.Store] # keyed by source node_id
|
||||
out_ports: dict[str, simpy.Store] # keyed by destination node_id
|
||||
```
|
||||
@@ -93,35 +93,51 @@ ADR-0007 D2 must be amended accordingly.
|
||||
|
||||
---
|
||||
|
||||
### D4. Unified fabric path for Memory R/W and Kernel Launch
|
||||
### D4. Fabric paths for Memory R/W and Kernel Launch
|
||||
|
||||
Both Memory R/W and Kernel Launch use the same fabric path to reach the target cube's M_CPU.
|
||||
The difference is what M_CPU does upon receiving the request.
|
||||
Memory R/W and Kernel Launch use **different** fabric paths.
|
||||
Memory operations bypass M_CPU and route directly to HBM via the crossbar.
|
||||
Kernel Launch routes through M_CPU for PE fan-out.
|
||||
|
||||
**Forward path (IO_CPU → target M_CPU):**
|
||||
**Memory R/W forward path (pcie_ep → hbm_ctrl, M_CPU bypass):**
|
||||
|
||||
```
|
||||
IO_CPU
|
||||
→ [transit cubes: ucie_out → wire → ucie_in → noc → ucie_out] (zero or more)
|
||||
→ target cube: ucie_in → noc → M_CPU
|
||||
```text
|
||||
pcie_ep → io_noc → io_ucie
|
||||
→ [transit cubes: ucie_in → noc → ucie_out] (zero or more)
|
||||
→ target cube: ucie_in → noc → xbar → hbm_ctrl
|
||||
```
|
||||
|
||||
**At M_CPU (diverges by operation type):**
|
||||
**Memory R/W completion path:**
|
||||
|
||||
```
|
||||
Memory R/W: M_CPU → M_CPU.DMA → noc → hbm_ctrl
|
||||
Kernel Launch: M_CPU → PE[0..n] (parallel fan-out)
|
||||
```text
|
||||
hbm_ctrl → xbar → noc → [transit cubes: ucie → noc → ucie]
|
||||
→ io_ucie → io_noc → pcie_ep
|
||||
```
|
||||
|
||||
**Completion path (reverse, same fabric):**
|
||||
**Kernel Launch forward path (pcie_ep → io_cpu → M_CPU → PE):**
|
||||
|
||||
```text
|
||||
pcie_ep → io_noc → io_cpu → io_noc → io_ucie
|
||||
→ [transit cubes: ucie_in → noc → ucie_out] (zero or more)
|
||||
→ target cube: ucie_in → noc → M_CPU → PE[0..n] (parallel fan-out)
|
||||
```
|
||||
Memory R/W: hbm_ctrl → noc → M_CPU.DMA → M_CPU
|
||||
Kernel Launch: PE[0..n] all complete → M_CPU (aggregation)
|
||||
|
||||
M_CPU → [transit cubes: ucie → noc → ucie] → IO_CPU → runtime_api
|
||||
**Kernel Launch completion path:**
|
||||
|
||||
```text
|
||||
PE[0..n] all complete → M_CPU (aggregation)
|
||||
→ noc → [transit cubes: ucie → noc → ucie]
|
||||
→ io_ucie → io_noc → io_cpu → io_noc → pcie_ep
|
||||
```
|
||||
|
||||
**Rationale for M_CPU bypass on Memory R/W:**
|
||||
|
||||
Memory write/read operations do not require command interpretation or PE
|
||||
dispatch — they are direct data transfers to/from HBM. Routing through M_CPU
|
||||
would add unnecessary overhead (5ns) without functional benefit. The io_noc
|
||||
inside the IO chiplet handles the routing decision: memory operations go
|
||||
directly to cube fabric, while kernel launches are forwarded to io_cpu first.
|
||||
|
||||
---
|
||||
|
||||
### D5. M_CPU.DMA is an internal subcomponent of M_CPU
|
||||
@@ -146,7 +162,7 @@ M_CPU.DMA does not appear as a node in the compiled topology graph.
|
||||
A cube that is not the target of a memory or kernel request acts as a transit node.
|
||||
Transit cubes forward requests without consuming them:
|
||||
|
||||
```
|
||||
```text
|
||||
ucie_in (from upstream) → noc → ucie_out (to downstream)
|
||||
```
|
||||
|
||||
@@ -187,3 +203,5 @@ It is used for shard comparison in `_route_kernel` and as a regression guard.
|
||||
- ADR-0009 D3 (kernel execution fan-out; fabric path to be referenced)
|
||||
- ADR-0014 D4 (DMA engine capacity=1)
|
||||
- ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal)
|
||||
- ADR-0016 (IOChiplet NOC and memory data path)
|
||||
- ADR-0017 (cube NOC 2D mesh architecture)
|
||||
|
||||
Reference in New Issue
Block a user