Files
kernbench2/docs/adr/ADR-0015-component-port-wire-model.md
T
ywkang fc6abbc8ee Add CHANGES.md, README, update SPEC/ADRs for release 2
- CHANGES.md: detailed changelog for release 1 and 2
- README.md: full project docs with install, probe, run, test usage
- SPEC.md: add ADR-0014~0017 references, update R7 for pcie_ep endpoint
- ADR-0003: update NOC description to reference ADR-0017
- ADR-0004: add HBM efficiency factor (0.8) to BW guarantee contract
- ADR-0014: status Proposed -> Accepted
- ADR-0015: update D4 to M_CPU bypass for Memory R/W, add ADR-0016/0017 links
- ADR-0016 (new): IOChiplet NOC and memory data path
- ADR-0017 (new): Cube NOC 2D mesh architecture
- Fix MD lint warnings (unfenced code blocks) across all docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 01:43:15 -07:00

208 lines
6.7 KiB
Markdown

# ADR-0015: Component Port/Wire Model and Fabric Routing
## Status
Accepted
## Context
ADR-0007 D2 assigns path-walking and low-level request decomposition to the simulation engine.
In practice, the engine iterates the topology path and calls `run()` on each component
sequentially — conflating routing policy with component behavior and preventing realistic
hardware modeling (queues, contention, fan-out).
ADR-0007 D3 already states that components own fan-out and aggregation, but the current
implementation does not enforce this for fabric traversal.
This ADR defines:
- how components communicate via typed port queues,
- how propagation delay is modeled (wire processes with BW occupancy),
- the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch (via M_CPU),
- the reduced role of the simulation engine,
- M_CPU.DMA as an internal subcomponent of M_CPU.
---
## Decision
### D1. Component port model
Each component has typed input/output ports modeled as SimPy Stores:
```text
in_ports: dict[str, simpy.Store] # keyed by source node_id
out_ports: dict[str, simpy.Store] # keyed by destination node_id
```
Ports are created at engine initialization based on graph edges.
Each directed edge (src → dst) results in:
- `src.out_ports[dst]` — the sending end
- `dst.in_ports[src]` — the receiving end
---
### D2. Wire process (propagation delay + BW occupancy)
For each directed edge (src, dst) in the topology graph, a SimPy wire process
models propagation delay and BW occupancy:
```python
def wire_process(env, out_port, in_port, delay_ns, bw_gbs):
available_at = 0.0
while True:
cmd = yield out_port.get()
if bw_gbs > 0:
nbytes = getattr(cmd, "nbytes", 0)
if nbytes > 0:
wait = available_at - env.now
if wait > 0:
yield env.timeout(wait)
available_at = env.now + (nbytes / bw_gbs)
yield env.timeout(delay_ns)
yield in_port.put(cmd)
```
Wire processes are started at engine initialization.
Each directed edge maintains an `available_at` timestamp tracking when the link
becomes free for the next transaction. When a transaction occupies a link, the
next transaction on the same directed link must wait until occupancy clears
(back-to-back serialization). TX and RX directions are independent (separate
wire processes with separate `available_at` state).
---
### D3. Engine role (reduced)
The simulation engine MUST:
- wire components at initialization (create port Stores, start wire processes),
- identify the entry component for each request type (PCIE_EP),
- put the request into the entry component's in_port,
- wait for a completion event.
The simulation engine MUST NOT:
- walk the topology path during request execution,
- call component `run()` methods directly,
- track per-hop latency or decompose fan-out.
This supersedes ADR-0007 D2's "decompose operations into low-level requests" clause.
ADR-0007 D2 must be amended accordingly.
---
### D4. Fabric paths for Memory R/W and Kernel Launch
Memory R/W and Kernel Launch use **different** fabric paths.
Memory operations bypass M_CPU and route directly to HBM via the crossbar.
Kernel Launch routes through M_CPU for PE fan-out.
**Memory R/W forward path (pcie_ep → hbm_ctrl, M_CPU bypass):**
```text
pcie_ep → io_noc → io_ucie
→ [transit cubes: ucie_in → noc → ucie_out] (zero or more)
→ target cube: ucie_in → noc → xbar → hbm_ctrl
```
**Memory R/W completion path:**
```text
hbm_ctrl → xbar → noc → [transit cubes: ucie → noc → ucie]
→ io_ucie → io_noc → pcie_ep
```
**Kernel Launch forward path (pcie_ep → io_cpu → M_CPU → PE):**
```text
pcie_ep → io_noc → io_cpu → io_noc → io_ucie
→ [transit cubes: ucie_in → noc → ucie_out] (zero or more)
→ target cube: ucie_in → noc → M_CPU → PE[0..n] (parallel fan-out)
```
**Kernel Launch completion path:**
```text
PE[0..n] all complete → M_CPU (aggregation)
→ noc → [transit cubes: ucie → noc → ucie]
→ io_ucie → io_noc → io_cpu → io_noc → pcie_ep
```
**Rationale for M_CPU bypass on Memory R/W:**
Memory write/read operations do not require command interpretation or PE
dispatch — they are direct data transfers to/from HBM. Routing through M_CPU
would add unnecessary overhead (5ns) without functional benefit. The io_noc
inside the IO chiplet handles the routing decision: memory operations go
directly to cube fabric, while kernel launches are forwarded to io_cpu first.
---
### D5. M_CPU.DMA is an internal subcomponent of M_CPU
M_CPU.DMA is NOT a separate topology node.
It is an internal subcomponent owned by the M_CPU component implementation.
M_CPU.DMA:
- owns the DMA READ and DMA WRITE queues (capacity=1 each, per ADR-0014 D4),
- issues memory requests over the NOC to hbm_ctrl,
- receives completion from hbm_ctrl via the NOC,
- reports completion to M_CPU,
- is created and managed inside M_CPU's `__init__` and `run()`.
M_CPU.DMA does not appear as a node in the compiled topology graph.
---
### D6. Transit cube forwarding
A cube that is not the target of a memory or kernel request acts as a transit node.
Transit cubes forward requests without consuming them:
```text
ucie_in (from upstream) → noc → ucie_out (to downstream)
```
Transit forwarding is implemented entirely within the ucie_in component.
The noc and ucie_out components in a transit cube forward the packet without modification.
---
### D7. _formula_latency is preserved as a lower-bound cross-check
The path-based formula latency function (`_formula_latency`) is preserved in the engine
as a lower bound for correctness verification.
Invariant:
- Phase 0: `_formula_latency == component model total_ns`
- Phase 1+: `_formula_latency <= component model total_ns` (contention adds queueing)
This function is independent of the port/wire model and requires only the topology graph.
It is used for shard comparison in `_route_kernel` and as a regression guard.
---
## Consequences
- Components model realistic hardware behavior (queues, contention, fan-out).
- Propagation delay is modeled accurately per edge.
- Engine is decoupled from routing policy.
- Component implementations remain swappable via DI (ADR-0007 D3).
- ADR-0007 D2 must be amended to remove path-walking from engine responsibilities.
- ADR-0009 D3 should be updated to reference the unified fabric path (D4 above).
---
## Links
- ADR-0007 D2 (to be amended: engine path-walking clause)
- ADR-0009 D3 (kernel execution fan-out; fabric path to be referenced)
- ADR-0014 D4 (DMA engine capacity=1)
- ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal)
- ADR-0016 (IOChiplet NOC and memory data path)
- ADR-0017 (cube NOC 2D mesh architecture)