commit - release 1

2026-03-18 11:47:48 -07:00
commit 6f43807900
109 changed files with 14909 additions and 0 deletions
@@ -0,0 +1,178 @@
+# ADR-0015: Component Port/Wire Model and Fabric Routing
+
+## Status
+
+Proposed
+
+## Context
+
+ADR-0007 D2 assigns path-walking and low-level request decomposition to the simulation engine.
+In practice, the engine iterates the topology path and calls `run()` on each component
+sequentially — conflating routing policy with component behavior and preventing realistic
+hardware modeling (queues, contention, fan-out).
+
+ADR-0007 D3 already states that components own fan-out and aggregation, but the current
+implementation does not enforce this for fabric traversal.
+
+This ADR defines:
+
+- how components communicate via typed port queues,
+- how propagation delay is modeled (wire processes),
+- the fabric path for Memory R/W through M_CPU.DMA,
+- the reduced role of the simulation engine,
+- M_CPU.DMA as an internal subcomponent of M_CPU.
+
+---
+
+## Decision
+
+### D1. Component port model
+
+Each component has typed input/output ports modeled as SimPy Stores:
+
+```
+in_ports:  dict[str, simpy.Store]   # keyed by source node_id
+out_ports: dict[str, simpy.Store]   # keyed by destination node_id
+```
+
+Ports are created at engine initialization based on graph edges.
+Each directed edge (src → dst) results in:
+
+- `src.out_ports[dst]`  — the sending end
+- `dst.in_ports[src]`   — the receiving end
+
+---
+
+### D2. Wire process (propagation delay)
+
+For each directed edge (src, dst) in the topology graph, a SimPy wire process
+models propagation delay:
+
+```python
+def wire_process(env, out_port, in_port, delay_ns):
+    while True:
+        cmd = yield out_port.get()
+        yield env.timeout(delay_ns)
+        yield in_port.put(cmd)
+```
+
+Wire processes are started at engine initialization.
+BW constraints are enforced by the sending component's out_port capacity or token model,
+not by the wire process itself.
+
+---
+
+### D3. Engine role (reduced)
+
+The simulation engine MUST:
+
+- wire components at initialization (create port Stores, start wire processes),
+- identify the entry component for each request type (PCIE_EP),
+- put the request into the entry component's in_port,
+- wait for a completion event.
+
+The simulation engine MUST NOT:
+
+- walk the topology path during request execution,
+- call component `run()` methods directly,
+- track per-hop latency or decompose fan-out.
+
+This supersedes ADR-0007 D2's "decompose operations into low-level requests" clause.
+ADR-0007 D2 must be amended accordingly.
+
+---
+
+### D4. Unified fabric path for Memory R/W and Kernel Launch
+
+Both Memory R/W and Kernel Launch use the same fabric path to reach the target cube's M_CPU.
+The difference is what M_CPU does upon receiving the request.
+
+**Forward path (IO_CPU → target M_CPU):**
+
+```
+IO_CPU
+  → [transit cubes: ucie_out → wire → ucie_in → noc → ucie_out]  (zero or more)
+  → target cube: ucie_in → noc → M_CPU
+```
+
+**At M_CPU (diverges by operation type):**
+
+```
+Memory R/W:     M_CPU → M_CPU.DMA → noc → hbm_ctrl
+Kernel Launch:  M_CPU → PE[0..n] (parallel fan-out)
+```
+
+**Completion path (reverse, same fabric):**
+
+```
+Memory R/W:     hbm_ctrl → noc → M_CPU.DMA → M_CPU
+Kernel Launch:  PE[0..n] all complete → M_CPU (aggregation)
+
+M_CPU → [transit cubes: ucie → noc → ucie] → IO_CPU → runtime_api
+```
+
+---
+
+### D5. M_CPU.DMA is an internal subcomponent of M_CPU
+
+M_CPU.DMA is NOT a separate topology node.
+It is an internal subcomponent owned by the M_CPU component implementation.
+
+M_CPU.DMA:
+
+- owns the DMA READ and DMA WRITE queues (capacity=1 each, per ADR-0014 D4),
+- issues memory requests over the NOC to hbm_ctrl,
+- receives completion from hbm_ctrl via the NOC,
+- reports completion to M_CPU,
+- is created and managed inside M_CPU's `__init__` and `run()`.
+
+M_CPU.DMA does not appear as a node in the compiled topology graph.
+
+---
+
+### D6. Transit cube forwarding
+
+A cube that is not the target of a memory or kernel request acts as a transit node.
+Transit cubes forward requests without consuming them:
+
+```
+ucie_in (from upstream) → noc → ucie_out (to downstream)
+```
+
+Transit forwarding is implemented entirely within the ucie_in component.
+The noc and ucie_out components in a transit cube forward the packet without modification.
+
+---
+
+### D7. _formula_latency is preserved as a lower-bound cross-check
+
+The path-based formula latency function (`_formula_latency`) is preserved in the engine
+as a lower bound for correctness verification.
+
+Invariant:
+
+- Phase 0: `_formula_latency == component model total_ns`
+- Phase 1+: `_formula_latency <= component model total_ns` (contention adds queueing)
+
+This function is independent of the port/wire model and requires only the topology graph.
+It is used for shard comparison in `_route_kernel` and as a regression guard.
+
+---
+
+## Consequences
+
+- Components model realistic hardware behavior (queues, contention, fan-out).
+- Propagation delay is modeled accurately per edge.
+- Engine is decoupled from routing policy.
+- Component implementations remain swappable via DI (ADR-0007 D3).
+- ADR-0007 D2 must be amended to remove path-walking from engine responsibilities.
+- ADR-0009 D3 should be updated to reference the unified fabric path (D4 above).
+
+---
+
+## Links
+
+- ADR-0007 D2 (to be amended: engine path-walking clause)
+- ADR-0009 D3 (kernel execution fan-out; fabric path to be referenced)
+- ADR-0014 D4 (DMA engine capacity=1)
+- ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal)