Files
kernbench2/docs/adr-ko/ADR-0015-dev-component-port-wire-model.md
T
ywkang a796c1d2f7 ADR: bilingual structure — EN canonical in adr/, KO mirror in adr-ko/
Establish English as the canonical ADR language with Korean translations
held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror).
Promotion from adr-proposed/ to adr/ now writes English to adr/ and the
Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md.

- Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English,
  2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix
  dropped). ADR-0023 EN regenerated against KO source which had newer
  HW Realization Notes (D16-D23) section.
- docs/adr-history/ left frozen by design (transitional state).
- CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark
  docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline
  section covering bidirectional sync, conflict resolution (EN wins),
  and proposed-language freedom.
- tools/verify_adr_lang_pairs.py: new verification tool checking pair
  completeness, filename mirroring, ADR-ID match, Status byte-equality.
  Pre-commit hook intentionally not added; run on demand or in CI.
- tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF
  normalization, em-dash title separator, underscore-slug edge case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:38:44 -07:00

6.3 KiB

ADR-0015: Component Port/Wire Model and Fabric Routing

Status

Accepted

Context

Realistic hardware modeling — queues, contention, fan-out — requires that components own fabric traversal while the simulation engine handles only initialization and completion observation. Direct method calls between components, or path-walking inside the engine, defeat queueing and contention semantics.

This ADR defines:

  • how components communicate via typed port queues,
  • how propagation delay is modeled (wire processes with BW occupancy),
  • the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch (via M_CPU),
  • the engine's reduced role (wire init + completion observation only),
  • M_CPU.DMA as an internal subcomponent of M_CPU.

Decision

D1. Component port model

Each component has typed input/output ports modeled as SimPy Stores:

in_ports:  dict[str, simpy.Store]   # keyed by source node_id
out_ports: dict[str, simpy.Store]   # keyed by destination node_id

Ports are created at engine initialization based on graph edges. Each directed edge (src → dst) results in:

  • src.out_ports[dst] — the sending end
  • dst.in_ports[src] — the receiving end

D2. Wire process (propagation delay + BW occupancy)

For each directed edge (src, dst) in the topology graph, a SimPy wire process models propagation delay and BW occupancy:

def wire_process(env, out_port, in_port, delay_ns, bw_gbs):
    available_at = 0.0
    while True:
        cmd = yield out_port.get()
        if bw_gbs > 0:
            nbytes = getattr(cmd, "nbytes", 0)
            if nbytes > 0:
                wait = available_at - env.now
                if wait > 0:
                    yield env.timeout(wait)
                available_at = env.now + (nbytes / bw_gbs)
        yield env.timeout(delay_ns)
        yield in_port.put(cmd)

Wire processes are started at engine initialization. Each directed edge maintains an available_at timestamp tracking when the link becomes free for the next transaction. When a transaction occupies a link, the next transaction on the same directed link must wait until occupancy clears (back-to-back serialization). TX and RX directions are independent (separate wire processes with separate available_at state).


D3. Engine role (reduced)

The simulation engine MUST:

  • wire components at initialization (create port Stores, start wire processes),
  • identify the entry component for each request type (PCIE_EP),
  • put the request into the entry component's in_port,
  • wait for a completion event.

The simulation engine MUST NOT:

  • walk the topology path during request execution,
  • call component run() methods directly,
  • track per-hop latency or decompose fan-out.

D4. Fabric paths for Memory R/W and Kernel Launch

Memory R/W and Kernel Launch use different fabric paths. Memory operations bypass M_CPU and route directly to HBM via the crossbar. Kernel Launch routes through M_CPU for PE fan-out.

Memory R/W forward path (pcie_ep → hbm_ctrl, M_CPU bypass):

pcie_ep → io_noc → io_ucie
  → [transit cubes: ucie_in → noc → ucie_out]  (zero or more)
  → target cube: ucie_in → router mesh → hbm_ctrl

Memory R/W completion path:

hbm_ctrl → router mesh → [transit cubes: ucie → router mesh → ucie]
  → io_ucie → io_noc → pcie_ep

Kernel Launch forward path (pcie_ep → io_cpu → M_CPU → PE):

pcie_ep → io_noc → io_cpu → io_noc → io_ucie
  → [transit cubes: ucie_in → noc → ucie_out]  (zero or more)
  → target cube: ucie_in → noc → M_CPU → PE[0..n] (parallel fan-out)

Kernel Launch completion path:

PE[0..n] all complete → M_CPU (aggregation)
  → noc → [transit cubes: ucie → noc → ucie]
  → io_ucie → io_noc → io_cpu → io_noc → pcie_ep

Rationale for M_CPU bypass on Memory R/W:

Memory write/read operations do not require command interpretation or PE dispatch — they are direct data transfers to/from HBM. Routing through M_CPU would add unnecessary overhead (5ns) without functional benefit. The io_noc inside the IO chiplet handles the routing decision: memory operations go directly to cube fabric, while kernel launches are forwarded to io_cpu first.


D5. M_CPU.DMA is an internal subcomponent of M_CPU

M_CPU.DMA is NOT a separate topology node. It is an internal subcomponent owned by the M_CPU component implementation.

M_CPU.DMA:

  • owns the DMA READ and DMA WRITE queues (capacity=1 each, per ADR-0014 D4),
  • issues memory requests over the NOC to hbm_ctrl,
  • receives completion from hbm_ctrl via the NOC,
  • reports completion to M_CPU,
  • is created and managed inside M_CPU's __init__ and run().

M_CPU.DMA does not appear as a node in the compiled topology graph.


D6. Transit cube forwarding

A cube that is not the target of a memory or kernel request acts as a transit node. Transit cubes forward requests without consuming them:

ucie_in (from upstream) → noc → ucie_out (to downstream)

Transit forwarding is implemented entirely within the ucie_in component. The noc and ucie_out components in a transit cube forward the packet without modification.


D7. _formula_latency is preserved as a lower-bound cross-check

The path-based formula latency function (_formula_latency) is preserved in the engine as a lower bound for correctness verification.

Invariant:

  • Phase 0: _formula_latency == component model total_ns
  • Phase 1+: _formula_latency <= component model total_ns (contention adds queueing)

This function is independent of the port/wire model and requires only the topology graph. It is used for shard comparison in _route_kernel and as a regression guard.


Consequences

  • Components model realistic hardware behavior (queues, contention, fan-out).
  • Propagation delay is modeled accurately per edge.
  • Engine is decoupled from routing policy.
  • Component implementations remain swappable via DI (ADR-0007 D3).

  • ADR-0007 D2 (engine role boundary)
  • ADR-0009 D3 (kernel execution fan-out hierarchy)
  • ADR-0014 D4 (DMA engine capacity=1)
  • ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal)
  • ADR-0016 (IOChiplet NOC and memory data path)
  • ADR-0017 (cube NOC 2D mesh architecture)
  • ADR-0033 (Latency model assumptions built on these mechanisms)