687c98086d
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
(dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
retroactive docs pending verification.
Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
deleted; ADR-0019/0021 moved to adr-history with one-line stub status
Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
selection, flit-aware per-flit commit, async finalize, command-only
fallback path)
Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
"Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
(now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)
Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py
Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.
Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
(ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
203 lines
6.3 KiB
Markdown
203 lines
6.3 KiB
Markdown
# ADR-0015: Component Port/Wire Model and Fabric Routing
|
|
|
|
## Status
|
|
|
|
Accepted
|
|
|
|
## Context
|
|
|
|
Realistic hardware modeling — queues, contention, fan-out — requires
|
|
that components own fabric traversal while the simulation engine
|
|
handles only initialization and completion observation. Direct method
|
|
calls between components, or path-walking inside the engine, defeat
|
|
queueing and contention semantics.
|
|
|
|
This ADR defines:
|
|
|
|
- how components communicate via typed port queues,
|
|
- how propagation delay is modeled (wire processes with BW occupancy),
|
|
- the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch
|
|
(via M_CPU),
|
|
- the engine's reduced role (wire init + completion observation only),
|
|
- M_CPU.DMA as an internal subcomponent of M_CPU.
|
|
|
|
---
|
|
|
|
## Decision
|
|
|
|
### D1. Component port model
|
|
|
|
Each component has typed input/output ports modeled as SimPy Stores:
|
|
|
|
```text
|
|
in_ports: dict[str, simpy.Store] # keyed by source node_id
|
|
out_ports: dict[str, simpy.Store] # keyed by destination node_id
|
|
```
|
|
|
|
Ports are created at engine initialization based on graph edges.
|
|
Each directed edge (src → dst) results in:
|
|
|
|
- `src.out_ports[dst]` — the sending end
|
|
- `dst.in_ports[src]` — the receiving end
|
|
|
|
---
|
|
|
|
### D2. Wire process (propagation delay + BW occupancy)
|
|
|
|
For each directed edge (src, dst) in the topology graph, a SimPy wire process
|
|
models propagation delay and BW occupancy:
|
|
|
|
```python
|
|
def wire_process(env, out_port, in_port, delay_ns, bw_gbs):
|
|
available_at = 0.0
|
|
while True:
|
|
cmd = yield out_port.get()
|
|
if bw_gbs > 0:
|
|
nbytes = getattr(cmd, "nbytes", 0)
|
|
if nbytes > 0:
|
|
wait = available_at - env.now
|
|
if wait > 0:
|
|
yield env.timeout(wait)
|
|
available_at = env.now + (nbytes / bw_gbs)
|
|
yield env.timeout(delay_ns)
|
|
yield in_port.put(cmd)
|
|
```
|
|
|
|
Wire processes are started at engine initialization.
|
|
Each directed edge maintains an `available_at` timestamp tracking when the link
|
|
becomes free for the next transaction. When a transaction occupies a link, the
|
|
next transaction on the same directed link must wait until occupancy clears
|
|
(back-to-back serialization). TX and RX directions are independent (separate
|
|
wire processes with separate `available_at` state).
|
|
|
|
---
|
|
|
|
### D3. Engine role (reduced)
|
|
|
|
The simulation engine MUST:
|
|
|
|
- wire components at initialization (create port Stores, start wire processes),
|
|
- identify the entry component for each request type (PCIE_EP),
|
|
- put the request into the entry component's in_port,
|
|
- wait for a completion event.
|
|
|
|
The simulation engine MUST NOT:
|
|
|
|
- walk the topology path during request execution,
|
|
- call component `run()` methods directly,
|
|
- track per-hop latency or decompose fan-out.
|
|
|
|
---
|
|
|
|
### D4. Fabric paths for Memory R/W and Kernel Launch
|
|
|
|
Memory R/W and Kernel Launch use **different** fabric paths.
|
|
Memory operations bypass M_CPU and route directly to HBM via the crossbar.
|
|
Kernel Launch routes through M_CPU for PE fan-out.
|
|
|
|
**Memory R/W forward path (pcie_ep → hbm_ctrl, M_CPU bypass):**
|
|
|
|
```text
|
|
pcie_ep → io_noc → io_ucie
|
|
→ [transit cubes: ucie_in → noc → ucie_out] (zero or more)
|
|
→ target cube: ucie_in → router mesh → hbm_ctrl
|
|
```
|
|
|
|
**Memory R/W completion path:**
|
|
|
|
```text
|
|
hbm_ctrl → router mesh → [transit cubes: ucie → router mesh → ucie]
|
|
→ io_ucie → io_noc → pcie_ep
|
|
```
|
|
|
|
**Kernel Launch forward path (pcie_ep → io_cpu → M_CPU → PE):**
|
|
|
|
```text
|
|
pcie_ep → io_noc → io_cpu → io_noc → io_ucie
|
|
→ [transit cubes: ucie_in → noc → ucie_out] (zero or more)
|
|
→ target cube: ucie_in → noc → M_CPU → PE[0..n] (parallel fan-out)
|
|
```
|
|
|
|
**Kernel Launch completion path:**
|
|
|
|
```text
|
|
PE[0..n] all complete → M_CPU (aggregation)
|
|
→ noc → [transit cubes: ucie → noc → ucie]
|
|
→ io_ucie → io_noc → io_cpu → io_noc → pcie_ep
|
|
```
|
|
|
|
**Rationale for M_CPU bypass on Memory R/W:**
|
|
|
|
Memory write/read operations do not require command interpretation or PE
|
|
dispatch — they are direct data transfers to/from HBM. Routing through M_CPU
|
|
would add unnecessary overhead (5ns) without functional benefit. The io_noc
|
|
inside the IO chiplet handles the routing decision: memory operations go
|
|
directly to cube fabric, while kernel launches are forwarded to io_cpu first.
|
|
|
|
---
|
|
|
|
### D5. M_CPU.DMA is an internal subcomponent of M_CPU
|
|
|
|
M_CPU.DMA is NOT a separate topology node.
|
|
It is an internal subcomponent owned by the M_CPU component implementation.
|
|
|
|
M_CPU.DMA:
|
|
|
|
- owns the DMA READ and DMA WRITE queues (capacity=1 each, per ADR-0014 D4),
|
|
- issues memory requests over the NOC to hbm_ctrl,
|
|
- receives completion from hbm_ctrl via the NOC,
|
|
- reports completion to M_CPU,
|
|
- is created and managed inside M_CPU's `__init__` and `run()`.
|
|
|
|
M_CPU.DMA does not appear as a node in the compiled topology graph.
|
|
|
|
---
|
|
|
|
### D6. Transit cube forwarding
|
|
|
|
A cube that is not the target of a memory or kernel request acts as a transit node.
|
|
Transit cubes forward requests without consuming them:
|
|
|
|
```text
|
|
ucie_in (from upstream) → noc → ucie_out (to downstream)
|
|
```
|
|
|
|
Transit forwarding is implemented entirely within the ucie_in component.
|
|
The noc and ucie_out components in a transit cube forward the packet without modification.
|
|
|
|
---
|
|
|
|
### D7. _formula_latency is preserved as a lower-bound cross-check
|
|
|
|
The path-based formula latency function (`_formula_latency`) is preserved in the engine
|
|
as a lower bound for correctness verification.
|
|
|
|
Invariant:
|
|
|
|
- Phase 0: `_formula_latency == component model total_ns`
|
|
- Phase 1+: `_formula_latency <= component model total_ns` (contention adds queueing)
|
|
|
|
This function is independent of the port/wire model and requires only the topology graph.
|
|
It is used for shard comparison in `_route_kernel` and as a regression guard.
|
|
|
|
---
|
|
|
|
## Consequences
|
|
|
|
- Components model realistic hardware behavior (queues, contention, fan-out).
|
|
- Propagation delay is modeled accurately per edge.
|
|
- Engine is decoupled from routing policy.
|
|
- Component implementations remain swappable via DI (ADR-0007 D3).
|
|
|
|
---
|
|
|
|
## Links
|
|
|
|
- ADR-0007 D2 (engine role boundary)
|
|
- ADR-0009 D3 (kernel execution fan-out hierarchy)
|
|
- ADR-0014 D4 (DMA engine capacity=1)
|
|
- ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal)
|
|
- ADR-0016 (IOChiplet NOC and memory data path)
|
|
- ADR-0017 (cube NOC 2D mesh architecture)
|
|
- ADR-0033 (Latency model assumptions built on these mechanisms)
|