Files
kernbench2/docs/adr/ADR-0015-component-port-wire-model.md
T
2026-03-18 11:47:48 -07:00

5.3 KiB

ADR-0015: Component Port/Wire Model and Fabric Routing

Status

Proposed

Context

ADR-0007 D2 assigns path-walking and low-level request decomposition to the simulation engine. In practice, the engine iterates the topology path and calls run() on each component sequentially — conflating routing policy with component behavior and preventing realistic hardware modeling (queues, contention, fan-out).

ADR-0007 D3 already states that components own fan-out and aggregation, but the current implementation does not enforce this for fabric traversal.

This ADR defines:

  • how components communicate via typed port queues,
  • how propagation delay is modeled (wire processes),
  • the fabric path for Memory R/W through M_CPU.DMA,
  • the reduced role of the simulation engine,
  • M_CPU.DMA as an internal subcomponent of M_CPU.

Decision

D1. Component port model

Each component has typed input/output ports modeled as SimPy Stores:

in_ports:  dict[str, simpy.Store]   # keyed by source node_id
out_ports: dict[str, simpy.Store]   # keyed by destination node_id

Ports are created at engine initialization based on graph edges. Each directed edge (src → dst) results in:

  • src.out_ports[dst] — the sending end
  • dst.in_ports[src] — the receiving end

D2. Wire process (propagation delay)

For each directed edge (src, dst) in the topology graph, a SimPy wire process models propagation delay:

def wire_process(env, out_port, in_port, delay_ns):
    while True:
        cmd = yield out_port.get()
        yield env.timeout(delay_ns)
        yield in_port.put(cmd)

Wire processes are started at engine initialization. BW constraints are enforced by the sending component's out_port capacity or token model, not by the wire process itself.


D3. Engine role (reduced)

The simulation engine MUST:

  • wire components at initialization (create port Stores, start wire processes),
  • identify the entry component for each request type (PCIE_EP),
  • put the request into the entry component's in_port,
  • wait for a completion event.

The simulation engine MUST NOT:

  • walk the topology path during request execution,
  • call component run() methods directly,
  • track per-hop latency or decompose fan-out.

This supersedes ADR-0007 D2's "decompose operations into low-level requests" clause. ADR-0007 D2 must be amended accordingly.


D4. Unified fabric path for Memory R/W and Kernel Launch

Both Memory R/W and Kernel Launch use the same fabric path to reach the target cube's M_CPU. The difference is what M_CPU does upon receiving the request.

Forward path (IO_CPU → target M_CPU):

IO_CPU
  → [transit cubes: ucie_out → wire → ucie_in → noc → ucie_out]  (zero or more)
  → target cube: ucie_in → noc → M_CPU

At M_CPU (diverges by operation type):

Memory R/W:     M_CPU → M_CPU.DMA → noc → hbm_ctrl
Kernel Launch:  M_CPU → PE[0..n] (parallel fan-out)

Completion path (reverse, same fabric):

Memory R/W:     hbm_ctrl → noc → M_CPU.DMA → M_CPU
Kernel Launch:  PE[0..n] all complete → M_CPU (aggregation)

M_CPU → [transit cubes: ucie → noc → ucie] → IO_CPU → runtime_api

D5. M_CPU.DMA is an internal subcomponent of M_CPU

M_CPU.DMA is NOT a separate topology node. It is an internal subcomponent owned by the M_CPU component implementation.

M_CPU.DMA:

  • owns the DMA READ and DMA WRITE queues (capacity=1 each, per ADR-0014 D4),
  • issues memory requests over the NOC to hbm_ctrl,
  • receives completion from hbm_ctrl via the NOC,
  • reports completion to M_CPU,
  • is created and managed inside M_CPU's __init__ and run().

M_CPU.DMA does not appear as a node in the compiled topology graph.


D6. Transit cube forwarding

A cube that is not the target of a memory or kernel request acts as a transit node. Transit cubes forward requests without consuming them:

ucie_in (from upstream) → noc → ucie_out (to downstream)

Transit forwarding is implemented entirely within the ucie_in component. The noc and ucie_out components in a transit cube forward the packet without modification.


D7. _formula_latency is preserved as a lower-bound cross-check

The path-based formula latency function (_formula_latency) is preserved in the engine as a lower bound for correctness verification.

Invariant:

  • Phase 0: _formula_latency == component model total_ns
  • Phase 1+: _formula_latency <= component model total_ns (contention adds queueing)

This function is independent of the port/wire model and requires only the topology graph. It is used for shard comparison in _route_kernel and as a regression guard.


Consequences

  • Components model realistic hardware behavior (queues, contention, fan-out).
  • Propagation delay is modeled accurately per edge.
  • Engine is decoupled from routing policy.
  • Component implementations remain swappable via DI (ADR-0007 D3).
  • ADR-0007 D2 must be amended to remove path-walking from engine responsibilities.
  • ADR-0009 D3 should be updated to reference the unified fabric path (D4 above).

  • ADR-0007 D2 (to be amended: engine path-walking clause)
  • ADR-0009 D3 (kernel execution fan-out; fabric path to be referenced)
  • ADR-0014 D4 (DMA engine capacity=1)
  • ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal)