kernbench2/docs/adr/ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md

# ADR-0035: M_CPU and M_CPU.DMA Component Model

## Status

Accepted

## Context

M_CPU is the cube-level command processor. It receives commands from
IO_CPU (or from PCIE_EP when the engine routes Memory R/W through
M_CPU as a fallback), fans them out to the PEs in its cube, and
aggregates per-PE responses into a single ResponseMsg sent back to
IO_CPU on the reverse path.

M_CPU.DMA is the cube-level DMA channel pair that handles Memory R/W
fan-out. Per ADR-0015 D5 it is **not** a separate topology node —
it lives as internal state of `MCpuComponent`.

This ADR documents the M_CPU component implementation that realizes
those responsibilities, including the three distinct fan-out paths
(Memory R/W, Kernel Launch, MMU Map/Unmap), the M_CPU.DMA resource
model, and the response aggregation contract.

## Decision

### D1. Role

M_CPU has three responsibilities:

1. **Transit forwarding** — when not the terminal hop (e.g., on the
   reverse response path PE → M_CPU → IO_CPU), forwards Transactions
   to `next_hop` in their pre-computed path.
2. **Multi-PE fan-out at terminal hop** — dispatches to one of three
   fan-out paths based on request type (D2).
3. **Response aggregation** — collects per-PE responses, sends a
   single aggregate ResponseMsg back to IO_CPU on the reverse path.

Per invocation (`run()`): applies `overhead_ns` once per incoming
Transaction.

M_CPU does **not**:

- Decide routing — paths are pre-computed by the router (ADR-0002).
- Handle PE-internal execution — PE_CPU / PE_SCHEDULER / engines
  (ADR-0014).
- Decode addresses — `ctx.resolver.resolve(pa)` returns the per-PE
  `hbm_ctrl.pe{X}` directly (ADR-0017 D9).
- Interpret tensor or kernel semantics — fan-out dispatch by Python
  isinstance check only.

### D2. Three fan-out paths dispatched by request type

At the terminal hop the worker dispatches by request type:

```python
elif self.ctx is not None and txn.request is not None:
    if isinstance(txn.request, KernelLaunchMsg):
        env.process(self._kernel_launch_fanout(env, txn))
    elif isinstance(txn.request, (MmuMapMsg, MmuUnmapMsg)):
        env.process(self._mmu_msg_fanout(env, txn))
    else:
        env.process(self._dma_fanout(env, txn))
```

Each path uses a different router method:

- `_dma_fanout` uses `ctx.router.find_mcpu_dma_path()` — the
  M_CPU-specific DMA path that avoids PE pipeline nodes.
- `_kernel_launch_fanout` uses `ctx.router.find_node_path()` — the
  generic NOC command path to PE_CPU.
- `_mmu_msg_fanout` uses `ctx.router.find_node_path()` — NOC command
  path to PE_MMU.

### D3. M_CPU.DMA internal subcomponent (ADR-0015 D5)

`MCpuComponent.start()` initializes two SimPy resources:

```python
self._dma_write = simpy.Resource(env, capacity=1)  # MemoryWriteMsg
self._dma_read  = simpy.Resource(env, capacity=1)  # MemoryReadMsg
```

Properties:

- **Not a topology node** — managed entirely inside `MCpuComponent`;
  does not appear in `topology.yaml` or in the compiled graph.
- **Independent read and write channels** — concurrent in-flight
  Memory R/W is allowed.
- **Capacity=1 per channel** serializes the **dispatch step**
  (`yield self.out_ports[...].put(...)`) of concurrent in-flight Memory
  R/W requests at this M_CPU. Actual fabric transfer time is modeled
  by wire processes between components (ADR-0015 D2) and by
  `drain_ns` at terminal hops; the DMA resource does not gate
  transfer duration.

Resource selection is request-type-based:

```python
dma_res = self._dma_write if isinstance(request, MemoryWriteMsg) else self._dma_read
```

### D4. Transit forwarding at non-terminal hops

When `txn.next_hop` is not None — typical for the reverse response
path (PE → M_CPU → IO_CPU) — the worker forwards normally:

```python
if next_hop:
    yield self.out_ports[next_hop].put(txn.advance())
```

The fan-out branches fire only at the terminal hop. The same component
therefore serves both forward command dispatch and reverse response
relay roles.

### D5. DMA fan-out (`_dma_fanout` — Memory R/W)

For each Memory R/W request at terminal hop:

1. `_resolve_dma_destinations(request)` returns a per-PE
   `hbm_ctrl.pe{X}` derived from the request's PA via
   `ctx.resolver.resolve(PhysAddr.decode(pa))` (ADR-0017 D9).
2. For each destination:
   - Acquire the appropriate DMA resource (`_dma_write` or
     `_dma_read`) via `with dma_res.request() as req`.
   - Resolve path via `ctx.router.find_mcpu_dma_path()`.
   - Compute `drain_ns = ctx.compute_drain_ns(path, nbytes)`.
   - Create sub-Transaction carrying `drain_ns` and dispatch to
     `path[1]`.
3. Track `max_drain_ns` across destinations and record it as
   `txn.result_data["xfer_ns"]` after all responses arrive.
4. After all per-PE responses are collected (D8), send an aggregate
   ResponseMsg on the reverse command path back to IO_CPU.

PA decode fallback (`f"{cube_prefix}.hbm_ctrl"`) is legacy dead code —
no such node exists after ADR-0017 D4's per-PE partitioning. Kept
defensively but does not route to a real destination.

### D6. Kernel launch fan-out (`_kernel_launch_fanout`)

For `KernelLaunchMsg` at terminal hop:

1. `_resolve_pe_ids(target_pe)` → list of PE ids in this cube.
2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_cpu"` via
   `ctx.router.find_node_path()`.
3. **`target_start_ns` handling** (ADR-0009 D5):
   - If the request already carries `target_start_ns` (stamped by
     IO_CPU per ADR-0036 D3): **pass through unchanged**.
   - If absent (direct-to-M_CPU launch in unit tests): compute a
     per-cube barrier `env.now + max(per-PE leg latency)` and stamp
     via `dataclasses.replace`.
4. Dispatch sub-Transactions with `nbytes=0` (kernel launch is a
   control message; preserving nbytes=0 keeps fan-out off the shared
   first-hop fabric BW, mirroring ADR-0036 D4).
5. After all per-PE responses arrive (D8), aggregate per-PE metrics
   from each sub-Transaction's `result_data` into the parent
   transaction:

   ```python
   txn.result_data["pe_exec_ns"]  = max(existing, max(pe_exec_values))
   txn.result_data["dma_ns"]      = max(existing, max(dma_values))
   txn.result_data["compute_ns"]  = max(existing, max(compute_values))
   ```

   The max-merge with the existing value matters because cross-cube
   IO_CPU fan-out shares the same parent `result_data`; merging
   prevents one cube from clobbering another's metric.
6. Send aggregate ResponseMsg on reverse path back to IO_CPU.

### D7. MMU map/unmap fan-out (`_mmu_msg_fanout`)

For `MmuMapMsg` / `MmuUnmapMsg` at terminal hop:

1. `_resolve_pe_ids(target_pe)` → PE ids.
2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_mmu"` via
   `find_node_path()`.
3. Dispatch sub-Transactions with `nbytes=0`.
4. PE_MMU is a terminal node — it does **not** send a ResponseMsg
   back. Instead, the sub-Transaction's own `sub_done` event is the
   completion signal.
5. Wait for all `sub_done` events in-line (does **not** use
   `_pending` counter — D8 is for response-bearing fan-out only).
6. Send aggregate ResponseMsg on reverse path back to IO_CPU.

### D8. Response aggregation (`_pending` + `_parent_txns`)

For DMA and kernel-launch fan-out (which expect per-PE ResponseMsg
arriving on the reverse path):

```python
self._pending: dict[str, tuple[int, int, simpy.Event]] = {}
self._parent_txns: dict[str, Any] = {}
```

- On dispatch: register `(expected, received=0, all_done)` and
  remember the parent transaction.
- `_worker` recognises responses by `is_response=True` and routes
  them to `_collect_response`, which increments `received` and
  signals `all_done` when `received >= expected`.
- After `yield all_done`, the fan-out path constructs the aggregate
  ResponseMsg:

  ```python
  resp_msg = ResponseMsg(
      correlation_id=request.correlation_id,
      request_id=request.request_id,
      src_cube=cube_id,
      src_pe=-1,             # -1 = M_CPU aggregate, not a single PE
      success=True,          # no failure semantics implemented
  )
  ```

- The response Transaction travels on `list(reversed(txn.path))`
  back to IO_CPU.

MMU fan-out (D7) uses a simpler in-line list of `sub_done` events
because PE_MMU is terminal — there is no ResponseMsg path to
intercept.

### D9. Helpers and configurable attribute

`_resolve_pe_ids(target_pe)`:

- `int` → `[target_pe]`
- `tuple[int, ...]` → `list(target_pe)`
- `"all"` → `range(n_slices)` where `n_slices` comes from cube
  `memory_map.hbm_slices_per_cube` (default 8).

Used by kernel-launch and MMU fan-out paths.

Single configurable attribute drives per-instance latency:

| Site | impl name | overhead_ns |
| --- | --- | --- |
| Cube `m_cpu` | `builtin.m_cpu` | 5.0 |

Applied once in `run()` per Transaction — models command
interpretation and dispatch-decision time at M_CPU.

## Consequences

### Positive

- Three fan-out paths are clearly separated by request type — adding
  a new request kind is an isinstance branch + one fan-out method.
- M_CPU.DMA channels are independent (read and write run concurrently)
  and serialize only the dispatch step at capacity=1.
- Transit-vs-terminal behavior is a single `if next_hop` check, so
  the same component handles forward dispatch and reverse response
  relay without role duplication.
- `target_start_ns` passthrough (D6) preserves the cross-cube barrier
  established by IO_CPU (ADR-0036 D3), while the fallback computation
  keeps direct-to-M_CPU unit tests working.
- Per-PE metric `max`-merge against existing parent `result_data`
  values is robust to cross-cube IO_CPU fan-out sharing the same
  parent.

### Negative

- No partial-failure semantics — a missing per-PE response stalls the
  parent `all_done` indefinitely. Acceptable for simulation; not
  suitable as a production-style endpoint.
- `_resolve_dma_destinations`'s cube-wide hbm_ctrl fallback is dead
  code (no such node exists post-ADR-0017 D4). Kept defensively;
  invites confusion and merits a follow-up cleanup.
- DMA resource serialization applies only at dispatch (the `put` call
  is instantaneous in unbounded stores). The capacity=1 channel
  models "one request in flight at a time at this M_CPU", not
  "transfer duration serialization" — readers must consult wire
  processes (ADR-0015 D2) and `drain_ns` for actual transfer
  parallelism.

## Links

- ADR-0009 D3 (M_CPU fan-out and aggregation completion semantics)
- ADR-0009 D5 (`target_start_ns` — passed through unchanged when
  present; computed as per-cube barrier when absent)
- ADR-0011 D-VA3 (MmuMapMsg fabric path includes M_CPU as PE fan-out
  point)
- ADR-0014 D4 (DMA engine capacity=1; M_CPU.DMA mirrors the same
  contract at cube level)
- ADR-0015 D5 (M_CPU.DMA is internal subcomponent of M_CPU, not a
  topology node)
- ADR-0017 D9 (AddressResolver returns per-PE `hbm_ctrl.pe{X}`)
- ADR-0036 D3 / D4 (IO_CPU stamps `target_start_ns`; M_CPU passes
  through unchanged; nbytes=0 invariant preserved through fan-out)