Files
kernbench2/docs/adr/ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md
T
ywkang 687c98086d ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
  (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
  docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
  docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
  retroactive docs pending verification.

Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
  TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
  Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
  deleted; ADR-0019/0021 moved to adr-history with one-line stub status

Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
  serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
  per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
  target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
  selection, flit-aware per-flit commit, async finalize, command-only
  fallback path)

Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
  "Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
  block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
  ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
  (now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)

Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
  ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py

Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.

Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
  (ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:15:55 -07:00

287 lines
11 KiB
Markdown

# ADR-0035: M_CPU and M_CPU.DMA Component Model
## Status
Accepted
## Context
M_CPU is the cube-level command processor. It receives commands from
IO_CPU (or from PCIE_EP when the engine routes Memory R/W through
M_CPU as a fallback), fans them out to the PEs in its cube, and
aggregates per-PE responses into a single ResponseMsg sent back to
IO_CPU on the reverse path.
M_CPU.DMA is the cube-level DMA channel pair that handles Memory R/W
fan-out. Per ADR-0015 D5 it is **not** a separate topology node —
it lives as internal state of `MCpuComponent`.
This ADR documents the M_CPU component implementation that realizes
those responsibilities, including the three distinct fan-out paths
(Memory R/W, Kernel Launch, MMU Map/Unmap), the M_CPU.DMA resource
model, and the response aggregation contract.
## Decision
### D1. Role
M_CPU has three responsibilities:
1. **Transit forwarding** — when not the terminal hop (e.g., on the
reverse response path PE → M_CPU → IO_CPU), forwards Transactions
to `next_hop` in their pre-computed path.
2. **Multi-PE fan-out at terminal hop** — dispatches to one of three
fan-out paths based on request type (D2).
3. **Response aggregation** — collects per-PE responses, sends a
single aggregate ResponseMsg back to IO_CPU on the reverse path.
Per invocation (`run()`): applies `overhead_ns` once per incoming
Transaction.
M_CPU does **not**:
- Decide routing — paths are pre-computed by the router (ADR-0002).
- Handle PE-internal execution — PE_CPU / PE_SCHEDULER / engines
(ADR-0014).
- Decode addresses — `ctx.resolver.resolve(pa)` returns the per-PE
`hbm_ctrl.pe{X}` directly (ADR-0017 D9).
- Interpret tensor or kernel semantics — fan-out dispatch by Python
isinstance check only.
### D2. Three fan-out paths dispatched by request type
At the terminal hop the worker dispatches by request type:
```python
elif self.ctx is not None and txn.request is not None:
if isinstance(txn.request, KernelLaunchMsg):
env.process(self._kernel_launch_fanout(env, txn))
elif isinstance(txn.request, (MmuMapMsg, MmuUnmapMsg)):
env.process(self._mmu_msg_fanout(env, txn))
else:
env.process(self._dma_fanout(env, txn))
```
Each path uses a different router method:
- `_dma_fanout` uses `ctx.router.find_mcpu_dma_path()` — the
M_CPU-specific DMA path that avoids PE pipeline nodes.
- `_kernel_launch_fanout` uses `ctx.router.find_node_path()` — the
generic NOC command path to PE_CPU.
- `_mmu_msg_fanout` uses `ctx.router.find_node_path()` — NOC command
path to PE_MMU.
### D3. M_CPU.DMA internal subcomponent (ADR-0015 D5)
`MCpuComponent.start()` initializes two SimPy resources:
```python
self._dma_write = simpy.Resource(env, capacity=1) # MemoryWriteMsg
self._dma_read = simpy.Resource(env, capacity=1) # MemoryReadMsg
```
Properties:
- **Not a topology node** — managed entirely inside `MCpuComponent`;
does not appear in `topology.yaml` or in the compiled graph.
- **Independent read and write channels** — concurrent in-flight
Memory R/W is allowed.
- **Capacity=1 per channel** serializes the **dispatch step**
(`yield self.out_ports[...].put(...)`) of concurrent in-flight Memory
R/W requests at this M_CPU. Actual fabric transfer time is modeled
by wire processes between components (ADR-0015 D2) and by
`drain_ns` at terminal hops; the DMA resource does not gate
transfer duration.
Resource selection is request-type-based:
```python
dma_res = self._dma_write if isinstance(request, MemoryWriteMsg) else self._dma_read
```
### D4. Transit forwarding at non-terminal hops
When `txn.next_hop` is not None — typical for the reverse response
path (PE → M_CPU → IO_CPU) — the worker forwards normally:
```python
if next_hop:
yield self.out_ports[next_hop].put(txn.advance())
```
The fan-out branches fire only at the terminal hop. The same component
therefore serves both forward command dispatch and reverse response
relay roles.
### D5. DMA fan-out (`_dma_fanout` — Memory R/W)
For each Memory R/W request at terminal hop:
1. `_resolve_dma_destinations(request)` returns a per-PE
`hbm_ctrl.pe{X}` derived from the request's PA via
`ctx.resolver.resolve(PhysAddr.decode(pa))` (ADR-0017 D9).
2. For each destination:
- Acquire the appropriate DMA resource (`_dma_write` or
`_dma_read`) via `with dma_res.request() as req`.
- Resolve path via `ctx.router.find_mcpu_dma_path()`.
- Compute `drain_ns = ctx.compute_drain_ns(path, nbytes)`.
- Create sub-Transaction carrying `drain_ns` and dispatch to
`path[1]`.
3. Track `max_drain_ns` across destinations and record it as
`txn.result_data["xfer_ns"]` after all responses arrive.
4. After all per-PE responses are collected (D8), send an aggregate
ResponseMsg on the reverse command path back to IO_CPU.
PA decode fallback (`f"{cube_prefix}.hbm_ctrl"`) is legacy dead code —
no such node exists after ADR-0017 D4's per-PE partitioning. Kept
defensively but does not route to a real destination.
### D6. Kernel launch fan-out (`_kernel_launch_fanout`)
For `KernelLaunchMsg` at terminal hop:
1. `_resolve_pe_ids(target_pe)` → list of PE ids in this cube.
2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_cpu"` via
`ctx.router.find_node_path()`.
3. **`target_start_ns` handling** (ADR-0009 D5):
- If the request already carries `target_start_ns` (stamped by
IO_CPU per ADR-0036 D3): **pass through unchanged**.
- If absent (direct-to-M_CPU launch in unit tests): compute a
per-cube barrier `env.now + max(per-PE leg latency)` and stamp
via `dataclasses.replace`.
4. Dispatch sub-Transactions with `nbytes=0` (kernel launch is a
control message; preserving nbytes=0 keeps fan-out off the shared
first-hop fabric BW, mirroring ADR-0036 D4).
5. After all per-PE responses arrive (D8), aggregate per-PE metrics
from each sub-Transaction's `result_data` into the parent
transaction:
```python
txn.result_data["pe_exec_ns"] = max(existing, max(pe_exec_values))
txn.result_data["dma_ns"] = max(existing, max(dma_values))
txn.result_data["compute_ns"] = max(existing, max(compute_values))
```
The max-merge with the existing value matters because cross-cube
IO_CPU fan-out shares the same parent `result_data`; merging
prevents one cube from clobbering another's metric.
6. Send aggregate ResponseMsg on reverse path back to IO_CPU.
### D7. MMU map/unmap fan-out (`_mmu_msg_fanout`)
For `MmuMapMsg` / `MmuUnmapMsg` at terminal hop:
1. `_resolve_pe_ids(target_pe)` → PE ids.
2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_mmu"` via
`find_node_path()`.
3. Dispatch sub-Transactions with `nbytes=0`.
4. PE_MMU is a terminal node — it does **not** send a ResponseMsg
back. Instead, the sub-Transaction's own `sub_done` event is the
completion signal.
5. Wait for all `sub_done` events in-line (does **not** use
`_pending` counter — D8 is for response-bearing fan-out only).
6. Send aggregate ResponseMsg on reverse path back to IO_CPU.
### D8. Response aggregation (`_pending` + `_parent_txns`)
For DMA and kernel-launch fan-out (which expect per-PE ResponseMsg
arriving on the reverse path):
```python
self._pending: dict[str, tuple[int, int, simpy.Event]] = {}
self._parent_txns: dict[str, Any] = {}
```
- On dispatch: register `(expected, received=0, all_done)` and
remember the parent transaction.
- `_worker` recognises responses by `is_response=True` and routes
them to `_collect_response`, which increments `received` and
signals `all_done` when `received >= expected`.
- After `yield all_done`, the fan-out path constructs the aggregate
ResponseMsg:
```python
resp_msg = ResponseMsg(
correlation_id=request.correlation_id,
request_id=request.request_id,
src_cube=cube_id,
src_pe=-1, # -1 = M_CPU aggregate, not a single PE
success=True, # no failure semantics implemented
)
```
- The response Transaction travels on `list(reversed(txn.path))`
back to IO_CPU.
MMU fan-out (D7) uses a simpler in-line list of `sub_done` events
because PE_MMU is terminal — there is no ResponseMsg path to
intercept.
### D9. Helpers and configurable attribute
`_resolve_pe_ids(target_pe)`:
- `int` → `[target_pe]`
- `tuple[int, ...]` → `list(target_pe)`
- `"all"` → `range(n_slices)` where `n_slices` comes from cube
`memory_map.hbm_slices_per_cube` (default 8).
Used by kernel-launch and MMU fan-out paths.
Single configurable attribute drives per-instance latency:
| Site | impl name | overhead_ns |
| --- | --- | --- |
| Cube `m_cpu` | `builtin.m_cpu` | 5.0 |
Applied once in `run()` per Transaction — models command
interpretation and dispatch-decision time at M_CPU.
## Consequences
### Positive
- Three fan-out paths are clearly separated by request type — adding
a new request kind is an isinstance branch + one fan-out method.
- M_CPU.DMA channels are independent (read and write run concurrently)
and serialize only the dispatch step at capacity=1.
- Transit-vs-terminal behavior is a single `if next_hop` check, so
the same component handles forward dispatch and reverse response
relay without role duplication.
- `target_start_ns` passthrough (D6) preserves the cross-cube barrier
established by IO_CPU (ADR-0036 D3), while the fallback computation
keeps direct-to-M_CPU unit tests working.
- Per-PE metric `max`-merge against existing parent `result_data`
values is robust to cross-cube IO_CPU fan-out sharing the same
parent.
### Negative
- No partial-failure semantics — a missing per-PE response stalls the
parent `all_done` indefinitely. Acceptable for simulation; not
suitable as a production-style endpoint.
- `_resolve_dma_destinations`'s cube-wide hbm_ctrl fallback is dead
code (no such node exists post-ADR-0017 D4). Kept defensively;
invites confusion and merits a follow-up cleanup.
- DMA resource serialization applies only at dispatch (the `put` call
is instantaneous in unbounded stores). The capacity=1 channel
models "one request in flight at a time at this M_CPU", not
"transfer duration serialization" — readers must consult wire
processes (ADR-0015 D2) and `drain_ns` for actual transfer
parallelism.
## Links
- ADR-0009 D3 (M_CPU fan-out and aggregation completion semantics)
- ADR-0009 D5 (`target_start_ns` — passed through unchanged when
present; computed as per-cube barrier when absent)
- ADR-0011 D-VA3 (MmuMapMsg fabric path includes M_CPU as PE fan-out
point)
- ADR-0014 D4 (DMA engine capacity=1; M_CPU.DMA mirrors the same
contract at cube level)
- ADR-0015 D5 (M_CPU.DMA is internal subcomponent of M_CPU, not a
topology node)
- ADR-0017 D9 (AddressResolver returns per-PE `hbm_ctrl.pe{X}`)
- ADR-0036 D3 / D4 (IO_CPU stamps `target_start_ns`; M_CPU passes
through unchanged; nbytes=0 invariant preserved through fan-out)