# ADR-0015: Component Port/Wire Model and Fabric Routing ## Status Accepted ## Context ADR-0007 D2 assigns path-walking and low-level request decomposition to the simulation engine. In practice, the engine iterates the topology path and calls `run()` on each component sequentially — conflating routing policy with component behavior and preventing realistic hardware modeling (queues, contention, fan-out). ADR-0007 D3 already states that components own fan-out and aggregation, but the current implementation does not enforce this for fabric traversal. This ADR defines: - how components communicate via typed port queues, - how propagation delay is modeled (wire processes with BW occupancy), - the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch (via M_CPU), - the reduced role of the simulation engine, - M_CPU.DMA as an internal subcomponent of M_CPU. --- ## Decision ### D1. Component port model Each component has typed input/output ports modeled as SimPy Stores: ```text in_ports: dict[str, simpy.Store] # keyed by source node_id out_ports: dict[str, simpy.Store] # keyed by destination node_id ``` Ports are created at engine initialization based on graph edges. Each directed edge (src → dst) results in: - `src.out_ports[dst]` — the sending end - `dst.in_ports[src]` — the receiving end --- ### D2. Wire process (propagation delay + BW occupancy) For each directed edge (src, dst) in the topology graph, a SimPy wire process models propagation delay and BW occupancy: ```python def wire_process(env, out_port, in_port, delay_ns, bw_gbs): available_at = 0.0 while True: cmd = yield out_port.get() if bw_gbs > 0: nbytes = getattr(cmd, "nbytes", 0) if nbytes > 0: wait = available_at - env.now if wait > 0: yield env.timeout(wait) available_at = env.now + (nbytes / bw_gbs) yield env.timeout(delay_ns) yield in_port.put(cmd) ``` Wire processes are started at engine initialization. Each directed edge maintains an `available_at` timestamp tracking when the link becomes free for the next transaction. When a transaction occupies a link, the next transaction on the same directed link must wait until occupancy clears (back-to-back serialization). TX and RX directions are independent (separate wire processes with separate `available_at` state). --- ### D3. Engine role (reduced) The simulation engine MUST: - wire components at initialization (create port Stores, start wire processes), - identify the entry component for each request type (PCIE_EP), - put the request into the entry component's in_port, - wait for a completion event. The simulation engine MUST NOT: - walk the topology path during request execution, - call component `run()` methods directly, - track per-hop latency or decompose fan-out. This supersedes ADR-0007 D2's "decompose operations into low-level requests" clause. ADR-0007 D2 must be amended accordingly. --- ### D4. Fabric paths for Memory R/W and Kernel Launch Memory R/W and Kernel Launch use **different** fabric paths. Memory operations bypass M_CPU and route directly to HBM via the crossbar. Kernel Launch routes through M_CPU for PE fan-out. **Memory R/W forward path (pcie_ep → hbm_ctrl, M_CPU bypass):** ```text pcie_ep → io_noc → io_ucie → [transit cubes: ucie_in → noc → ucie_out] (zero or more) → target cube: ucie_in → noc → xbar → hbm_ctrl ``` **Memory R/W completion path:** ```text hbm_ctrl → xbar → noc → [transit cubes: ucie → noc → ucie] → io_ucie → io_noc → pcie_ep ``` **Kernel Launch forward path (pcie_ep → io_cpu → M_CPU → PE):** ```text pcie_ep → io_noc → io_cpu → io_noc → io_ucie → [transit cubes: ucie_in → noc → ucie_out] (zero or more) → target cube: ucie_in → noc → M_CPU → PE[0..n] (parallel fan-out) ``` **Kernel Launch completion path:** ```text PE[0..n] all complete → M_CPU (aggregation) → noc → [transit cubes: ucie → noc → ucie] → io_ucie → io_noc → io_cpu → io_noc → pcie_ep ``` **Rationale for M_CPU bypass on Memory R/W:** Memory write/read operations do not require command interpretation or PE dispatch — they are direct data transfers to/from HBM. Routing through M_CPU would add unnecessary overhead (5ns) without functional benefit. The io_noc inside the IO chiplet handles the routing decision: memory operations go directly to cube fabric, while kernel launches are forwarded to io_cpu first. --- ### D5. M_CPU.DMA is an internal subcomponent of M_CPU M_CPU.DMA is NOT a separate topology node. It is an internal subcomponent owned by the M_CPU component implementation. M_CPU.DMA: - owns the DMA READ and DMA WRITE queues (capacity=1 each, per ADR-0014 D4), - issues memory requests over the NOC to hbm_ctrl, - receives completion from hbm_ctrl via the NOC, - reports completion to M_CPU, - is created and managed inside M_CPU's `__init__` and `run()`. M_CPU.DMA does not appear as a node in the compiled topology graph. --- ### D6. Transit cube forwarding A cube that is not the target of a memory or kernel request acts as a transit node. Transit cubes forward requests without consuming them: ```text ucie_in (from upstream) → noc → ucie_out (to downstream) ``` Transit forwarding is implemented entirely within the ucie_in component. The noc and ucie_out components in a transit cube forward the packet without modification. --- ### D7. _formula_latency is preserved as a lower-bound cross-check The path-based formula latency function (`_formula_latency`) is preserved in the engine as a lower bound for correctness verification. Invariant: - Phase 0: `_formula_latency == component model total_ns` - Phase 1+: `_formula_latency <= component model total_ns` (contention adds queueing) This function is independent of the port/wire model and requires only the topology graph. It is used for shard comparison in `_route_kernel` and as a regression guard. --- ## Consequences - Components model realistic hardware behavior (queues, contention, fan-out). - Propagation delay is modeled accurately per edge. - Engine is decoupled from routing policy. - Component implementations remain swappable via DI (ADR-0007 D3). - ADR-0007 D2 must be amended to remove path-walking from engine responsibilities. - ADR-0009 D3 should be updated to reference the unified fabric path (D4 above). --- ## Links - ADR-0007 D2 (to be amended: engine path-walking clause) - ADR-0009 D3 (kernel execution fan-out; fabric path to be referenced) - ADR-0014 D4 (DMA engine capacity=1) - ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal) - ADR-0016 (IOChiplet NOC and memory data path) - ADR-0017 (cube NOC 2D mesh architecture)