Files
kernbench2/docs/adr/ADR-0037-dev-forwarding-component.md
T
ywkang 687c98086d ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
  (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
  docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
  docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
  retroactive docs pending verification.

Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
  TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
  Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
  deleted; ADR-0019/0021 moved to adr-history with one-line stub status

Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
  serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
  per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
  target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
  selection, flit-aware per-flit commit, async finalize, command-only
  fallback path)

Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
  "Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
  block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
  ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
  (now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)

Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
  ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py

Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.

Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
  (ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:15:55 -07:00

201 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0037: Forwarding Component (forwarding_v1)
## Status
Accepted
## Context
The simulation graph has many node positions that exist purely to model
fabric traversal — NOC mesh routers, switches, UCIe protocol endpoints,
IO chiplet io_noc, transit cubes. These share a common pattern: receive
a message, apply per-component overhead (modeling header decode +
routing decision time), forward to the next hop along the pre-computed
path.
This ADR defines the contract for these transit nodes: a single
component type (`TransitComponent`) that handles flit-aware forwarding
with wormhole cut-through semantics, used under multiple impl names
according to the conceptual role each instance plays.
## Decision
### D1. Role
The Forwarding component (`TransitComponent` class) is a **stateless
transit node** in the simulation graph. It models any fabric position
where a message physically traverses but no semantic processing
happens.
Per traversal, the component:
1. Reads an incoming Transaction or Flit from an `in_port`.
2. Applies the configured per-component overhead (`overhead_ns`),
applied **once per Transaction** even across multi-flit payloads
(see D2).
3. Looks up the next hop along the Transaction's pre-computed `path`.
4. Forwards to the corresponding `out_port`; at the terminal node
(no next hop), signals `txn.done` once the `is_last` flit arrives.
The component **does NOT**:
- Decide routing — paths are pre-computed by the router (ADR-0002 /
ADR-0017 D2). Forwarding only executes the per-hop step.
- Model wire propagation or bandwidth occupancy — separate wire
processes between components handle that (ADR-0015 D2).
- Resolve addresses — the AddressResolver does that (ADR-0017 D9).
- Aggregate completion — terminal endpoints (IO_CPU, M_CPU, HBM_CTRL)
handle that.
### D2. First-flit overhead model (header decode)
Per-Transaction `overhead_ns` is applied **exactly once**, at first
flit arrival:
- `_txn_decoded: set[int]` tracks which Transactions have already
paid the overhead at this node.
- On first-flit arrival for a Transaction: `yield self.run(env,
msg.txn.nbytes)` — pays the overhead.
- Subsequent flits of the same Transaction skip the overhead — they
pipeline through with no extra delay.
- On `is_last` flit: remove the Transaction from `_txn_decoded`.
This models the real-HW behavior where header decode and routing
decision happen once on first flit; payload flits then stream through
the same path (wormhole cut-through). Multi-hop pipelining emerges
naturally — each hop adds its own first-flit overhead, but flits
after the first do not re-pay overhead at any hop they have already
passed first.
### D3. Serial worker forwarding (preserves order)
The component's worker is a single SimPy process that consumes flits
from `_inbox` and forwards them serially in arrival order. The
component does NOT spawn `env.process(...)` per flit.
Rationale: if the first flit yields on `overhead_ns` while subsequent
flits run in parallel processes, the later flits can overtake the
first. This produces out-of-order delivery and lets the `is_last`
flit arrive at the destination before the first flit — corrupting
both the transaction's completion semantics and any flit-index-based
processing downstream.
### D4. Path-based next-hop routing
Routing is **not** a Forwarding-component concern. The Transaction
arrives with a pre-computed `path` (built by the router; ADR-0002 /
ADR-0017 D2). The component just looks up its own position in the
path and forwards to `path[index + 1]`:
```python
def _next_hop_in_path(self, txn):
my_id = self.node.id
path = txn.path
for i, n in enumerate(path):
if n == my_id and i + 1 < len(path):
return path[i + 1]
return None
```
If `next_hop` is found and present in `out_ports`, the flit is
forwarded. Otherwise (terminal node), `txn.done.succeed()` is
invoked when the `is_last` flit arrives.
### D5. Flit-aware mode with Non-Flit fallback
`_FLIT_AWARE = True` opts this component out of the base class's
flit-reassembly logic in `_fan_in`. Flits are placed directly on
`_inbox` (no reassembly), enabling per-flit handling in the worker
loop (D2, D3).
Non-Flit messages — zero-byte control Transactions and other
non-chunkified payloads — fall through to the base class's legacy
`_forward_txn` path via `env.process`. This preserves backward
compatibility for control-plane traffic that does not benefit from
flit-level processing.
### D6. Multi-stream merging at the base class
Multi-stream FIFO merging at routers is the base class's
responsibility, not Forwarding's. The base class's `_fan_in` spawns
one process per `in_port`; all push to a single shared `_inbox`.
Flits from different upstream streams therefore interleave at
flit granularity in `_inbox`'s FIFO order.
The Forwarding worker simply consumes `_inbox` in arrival order —
correctly modeling per-router multi-flow arbitration as
fair-FIFO over the shared inbox.
### D7. Single implementation under multiple impl names
A single `TransitComponent` class is registered under four impl names
in `components.yaml`:
- `builtin.forwarding` — generic forwarding (e.g., `io_noc`,
`noc_router`, UCIe conn bridges)
- `builtin.switch` — tray-level switch
- `builtin.noc` — cube-level NOC fabric (legacy singleton; current
NOC routers use `builtin.forwarding`)
- `builtin.ucie` — UCIe protocol endpoint
All four aliases instantiate the same class with the same behavior.
Per-instance differentiation lives only in `attrs.overhead_ns`.
Separate impl names exist as intent tags for readability and to
allow future divergence without backward-incompatible config
changes.
### D8. Configurable `overhead_ns`
A single attribute drives per-instance latency:
| Usage site | impl name | overhead_ns |
| --- | --- | --- |
| Tray-level switch | `builtin.switch` | 5.0 |
| Cube NOC router | `builtin.forwarding` | 2.0 |
| IO chiplet io_noc | `builtin.forwarding` | 0.0 |
| UCIe protocol endpoint (`ucie-{N,S,E,W}`) | `builtin.ucie` | 8.0 |
| UCIe conn bridge (`ucie-{PORT}.conn{N}`) | `builtin.forwarding` | 0.0 |
Default is 0.0. The attribute is read at each `run()` invocation, so
dynamic reconfiguration is possible but not currently used.
## Consequences
### Positive
- A single class handles all transit-node roles in the simulation
graph — minimal code surface for a high-population component type.
- Flit-aware processing + serial worker preserves wormhole semantics
across multi-hop paths without per-flit process overhead.
- `overhead_ns` is the only per-instance tunable; routing, BW, and
address resolution stay cleanly separated in their own components /
modules.
- Multi-stream merging emerges from the base-class structure; no
router-specific logic duplicates fair-FIFO arbitration.
- Non-Flit fallback path keeps control-plane traffic working without
forcing every message into the flit framework.
### Negative
- The single class hides usage-site intent inside `attrs.overhead_ns`
configuration; readers must consult `topology.yaml` +
`components.yaml` to see which impl name maps to which behavior
class.
- Per-flit serial worker is a bottleneck if `overhead_ns` is large
and many concurrent transactions arrive at the same router; current
values (08 ns) make this negligible.
## Links
- ADR-0002 (Routing distance — path computation)
- ADR-0015 D1 (Component port model)
- ADR-0015 D2 (Wire process — BW + propagation, separate from this
component)
- ADR-0015 D6 (Transit cube forwarding pattern)
- ADR-0016 D1 (IO chiplet io_noc — uses this component)
- ADR-0017 D1 (Cube NOC routers — use this component)
- ADR-0017 D6 (UCIe decomposition — `ucie-{PORT}` instances use this
component)
- ADR-0033 D1 (Flit-aware pass-through, first-flit overhead,
multi-stream merge semantics)