# ADR-0035: M_CPU and M_CPU.DMA Component Model ## Status Accepted ## Context M_CPU is the cube-level command processor. It receives commands from IO_CPU (or from PCIE_EP when the engine routes Memory R/W through M_CPU as a fallback), fans them out to the PEs in its cube, and aggregates per-PE responses into a single ResponseMsg sent back to IO_CPU on the reverse path. M_CPU.DMA is the cube-level DMA channel pair that handles Memory R/W fan-out. Per ADR-0015 D5 it is **not** a separate topology node — it lives as internal state of `MCpuComponent`. This ADR documents the M_CPU component implementation that realizes those responsibilities, including the three distinct fan-out paths (Memory R/W, Kernel Launch, MMU Map/Unmap), the M_CPU.DMA resource model, and the response aggregation contract. ## Decision ### D1. Role M_CPU has three responsibilities: 1. **Transit forwarding** — when not the terminal hop (e.g., on the reverse response path PE → M_CPU → IO_CPU), forwards Transactions to `next_hop` in their pre-computed path. 2. **Multi-PE fan-out at terminal hop** — dispatches to one of three fan-out paths based on request type (D2). 3. **Response aggregation** — collects per-PE responses, sends a single aggregate ResponseMsg back to IO_CPU on the reverse path. Per invocation (`run()`): applies `overhead_ns` once per incoming Transaction. M_CPU does **not**: - Decide routing — paths are pre-computed by the router (ADR-0002). - Handle PE-internal execution — PE_CPU / PE_SCHEDULER / engines (ADR-0014). - Decode addresses — `ctx.resolver.resolve(pa)` returns the per-PE `hbm_ctrl.pe{X}` directly (ADR-0017 D9). - Interpret tensor or kernel semantics — fan-out dispatch by Python isinstance check only. ### D2. Three fan-out paths dispatched by request type At the terminal hop the worker dispatches by request type: ```python elif self.ctx is not None and txn.request is not None: if isinstance(txn.request, KernelLaunchMsg): env.process(self._kernel_launch_fanout(env, txn)) elif isinstance(txn.request, (MmuMapMsg, MmuUnmapMsg)): env.process(self._mmu_msg_fanout(env, txn)) else: env.process(self._dma_fanout(env, txn)) ``` Each path uses a different router method: - `_dma_fanout` uses `ctx.router.find_mcpu_dma_path()` — the M_CPU-specific DMA path that avoids PE pipeline nodes. - `_kernel_launch_fanout` uses `ctx.router.find_node_path()` — the generic NOC command path to PE_CPU. - `_mmu_msg_fanout` uses `ctx.router.find_node_path()` — NOC command path to PE_MMU. ### D3. M_CPU.DMA internal subcomponent (ADR-0015 D5) `MCpuComponent.start()` initializes two SimPy resources: ```python self._dma_write = simpy.Resource(env, capacity=1) # MemoryWriteMsg self._dma_read = simpy.Resource(env, capacity=1) # MemoryReadMsg ``` Properties: - **Not a topology node** — managed entirely inside `MCpuComponent`; does not appear in `topology.yaml` or in the compiled graph. - **Independent read and write channels** — concurrent in-flight Memory R/W is allowed. - **Capacity=1 per channel** serializes the **dispatch step** (`yield self.out_ports[...].put(...)`) of concurrent in-flight Memory R/W requests at this M_CPU. Actual fabric transfer time is modeled by wire processes between components (ADR-0015 D2) and by `drain_ns` at terminal hops; the DMA resource does not gate transfer duration. Resource selection is request-type-based: ```python dma_res = self._dma_write if isinstance(request, MemoryWriteMsg) else self._dma_read ``` ### D4. Transit forwarding at non-terminal hops When `txn.next_hop` is not None — typical for the reverse response path (PE → M_CPU → IO_CPU) — the worker forwards normally: ```python if next_hop: yield self.out_ports[next_hop].put(txn.advance()) ``` The fan-out branches fire only at the terminal hop. The same component therefore serves both forward command dispatch and reverse response relay roles. ### D5. DMA fan-out (`_dma_fanout` — Memory R/W) For each Memory R/W request at terminal hop: 1. `_resolve_dma_destinations(request)` returns a per-PE `hbm_ctrl.pe{X}` derived from the request's PA via `ctx.resolver.resolve(PhysAddr.decode(pa))` (ADR-0017 D9). 2. For each destination: - Acquire the appropriate DMA resource (`_dma_write` or `_dma_read`) via `with dma_res.request() as req`. - Resolve path via `ctx.router.find_mcpu_dma_path()`. - Compute `drain_ns = ctx.compute_drain_ns(path, nbytes)`. - Create sub-Transaction carrying `drain_ns` and dispatch to `path[1]`. 3. Track `max_drain_ns` across destinations and record it as `txn.result_data["xfer_ns"]` after all responses arrive. 4. After all per-PE responses are collected (D8), send an aggregate ResponseMsg on the reverse command path back to IO_CPU. PA decode fallback (`f"{cube_prefix}.hbm_ctrl"`) is legacy dead code — no such node exists after ADR-0017 D4's per-PE partitioning. Kept defensively but does not route to a real destination. ### D6. Kernel launch fan-out (`_kernel_launch_fanout`) For `KernelLaunchMsg` at terminal hop: 1. `_resolve_pe_ids(target_pe)` → list of PE ids in this cube. 2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_cpu"` via `ctx.router.find_node_path()`. 3. **`target_start_ns` handling** (ADR-0009 D5): - If the request already carries `target_start_ns` (stamped by IO_CPU per ADR-0036 D3): **pass through unchanged**. - If absent (direct-to-M_CPU launch in unit tests): compute a per-cube barrier `env.now + max(per-PE leg latency)` and stamp via `dataclasses.replace`. 4. Dispatch sub-Transactions with `nbytes=0` (kernel launch is a control message; preserving nbytes=0 keeps fan-out off the shared first-hop fabric BW, mirroring ADR-0036 D4). 5. After all per-PE responses arrive (D8), aggregate per-PE metrics from each sub-Transaction's `result_data` into the parent transaction: ```python txn.result_data["pe_exec_ns"] = max(existing, max(pe_exec_values)) txn.result_data["dma_ns"] = max(existing, max(dma_values)) txn.result_data["compute_ns"] = max(existing, max(compute_values)) ``` The max-merge with the existing value matters because cross-cube IO_CPU fan-out shares the same parent `result_data`; merging prevents one cube from clobbering another's metric. 6. Send aggregate ResponseMsg on reverse path back to IO_CPU. ### D7. MMU map/unmap fan-out (`_mmu_msg_fanout`) For `MmuMapMsg` / `MmuUnmapMsg` at terminal hop: 1. `_resolve_pe_ids(target_pe)` → PE ids. 2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_mmu"` via `find_node_path()`. 3. Dispatch sub-Transactions with `nbytes=0`. 4. PE_MMU is a terminal node — it does **not** send a ResponseMsg back. Instead, the sub-Transaction's own `sub_done` event is the completion signal. 5. Wait for all `sub_done` events in-line (does **not** use `_pending` counter — D8 is for response-bearing fan-out only). 6. Send aggregate ResponseMsg on reverse path back to IO_CPU. ### D8. Response aggregation (`_pending` + `_parent_txns`) For DMA and kernel-launch fan-out (which expect per-PE ResponseMsg arriving on the reverse path): ```python self._pending: dict[str, tuple[int, int, simpy.Event]] = {} self._parent_txns: dict[str, Any] = {} ``` - On dispatch: register `(expected, received=0, all_done)` and remember the parent transaction. - `_worker` recognises responses by `is_response=True` and routes them to `_collect_response`, which increments `received` and signals `all_done` when `received >= expected`. - After `yield all_done`, the fan-out path constructs the aggregate ResponseMsg: ```python resp_msg = ResponseMsg( correlation_id=request.correlation_id, request_id=request.request_id, src_cube=cube_id, src_pe=-1, # -1 = M_CPU aggregate, not a single PE success=True, # no failure semantics implemented ) ``` - The response Transaction travels on `list(reversed(txn.path))` back to IO_CPU. MMU fan-out (D7) uses a simpler in-line list of `sub_done` events because PE_MMU is terminal — there is no ResponseMsg path to intercept. ### D9. Helpers and configurable attribute `_resolve_pe_ids(target_pe)`: - `int` → `[target_pe]` - `tuple[int, ...]` → `list(target_pe)` - `"all"` → `range(n_slices)` where `n_slices` comes from cube `memory_map.hbm_slices_per_cube` (default 8). Used by kernel-launch and MMU fan-out paths. Single configurable attribute drives per-instance latency: | Site | impl name | overhead_ns | | --- | --- | --- | | Cube `m_cpu` | `builtin.m_cpu` | 5.0 | Applied once in `run()` per Transaction — models command interpretation and dispatch-decision time at M_CPU. ## Consequences ### Positive - Three fan-out paths are clearly separated by request type — adding a new request kind is an isinstance branch + one fan-out method. - M_CPU.DMA channels are independent (read and write run concurrently) and serialize only the dispatch step at capacity=1. - Transit-vs-terminal behavior is a single `if next_hop` check, so the same component handles forward dispatch and reverse response relay without role duplication. - `target_start_ns` passthrough (D6) preserves the cross-cube barrier established by IO_CPU (ADR-0036 D3), while the fallback computation keeps direct-to-M_CPU unit tests working. - Per-PE metric `max`-merge against existing parent `result_data` values is robust to cross-cube IO_CPU fan-out sharing the same parent. ### Negative - No partial-failure semantics — a missing per-PE response stalls the parent `all_done` indefinitely. Acceptable for simulation; not suitable as a production-style endpoint. - `_resolve_dma_destinations`'s cube-wide hbm_ctrl fallback is dead code (no such node exists post-ADR-0017 D4). Kept defensively; invites confusion and merits a follow-up cleanup. - DMA resource serialization applies only at dispatch (the `put` call is instantaneous in unbounded stores). The capacity=1 channel models "one request in flight at a time at this M_CPU", not "transfer duration serialization" — readers must consult wire processes (ADR-0015 D2) and `drain_ns` for actual transfer parallelism. ## Links - ADR-0009 D3 (M_CPU fan-out and aggregation completion semantics) - ADR-0009 D5 (`target_start_ns` — passed through unchanged when present; computed as per-cube barrier when absent) - ADR-0011 D-VA3 (MmuMapMsg fabric path includes M_CPU as PE fan-out point) - ADR-0014 D4 (DMA engine capacity=1; M_CPU.DMA mirrors the same contract at cube level) - ADR-0015 D5 (M_CPU.DMA is internal subcomponent of M_CPU, not a topology node) - ADR-0017 D9 (AddressResolver returns per-PE `hbm_ctrl.pe{X}`) - ADR-0036 D3 / D4 (IO_CPU stamps `target_start_ns`; M_CPU passes through unchanged; nbytes=0 invariant preserved through fan-out)