Files
kernbench2/docs/adr/ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md
ywkang 687c98086d ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
  (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
  docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
  docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
  retroactive docs pending verification.

Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
  TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
  Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
  deleted; ADR-0019/0021 moved to adr-history with one-line stub status

Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
  serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
  per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
  target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
  selection, flit-aware per-flit commit, async finalize, command-only
  fallback path)

Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
  "Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
  block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
  ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
  (now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)

Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
  ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py

Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.

Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
  (ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:15:55 -07:00

11 KiB

ADR-0035: M_CPU and M_CPU.DMA Component Model

Status

Accepted

Context

M_CPU is the cube-level command processor. It receives commands from IO_CPU (or from PCIE_EP when the engine routes Memory R/W through M_CPU as a fallback), fans them out to the PEs in its cube, and aggregates per-PE responses into a single ResponseMsg sent back to IO_CPU on the reverse path.

M_CPU.DMA is the cube-level DMA channel pair that handles Memory R/W fan-out. Per ADR-0015 D5 it is not a separate topology node — it lives as internal state of MCpuComponent.

This ADR documents the M_CPU component implementation that realizes those responsibilities, including the three distinct fan-out paths (Memory R/W, Kernel Launch, MMU Map/Unmap), the M_CPU.DMA resource model, and the response aggregation contract.

Decision

D1. Role

M_CPU has three responsibilities:

  1. Transit forwarding — when not the terminal hop (e.g., on the reverse response path PE → M_CPU → IO_CPU), forwards Transactions to next_hop in their pre-computed path.
  2. Multi-PE fan-out at terminal hop — dispatches to one of three fan-out paths based on request type (D2).
  3. Response aggregation — collects per-PE responses, sends a single aggregate ResponseMsg back to IO_CPU on the reverse path.

Per invocation (run()): applies overhead_ns once per incoming Transaction.

M_CPU does not:

  • Decide routing — paths are pre-computed by the router (ADR-0002).
  • Handle PE-internal execution — PE_CPU / PE_SCHEDULER / engines (ADR-0014).
  • Decode addresses — ctx.resolver.resolve(pa) returns the per-PE hbm_ctrl.pe{X} directly (ADR-0017 D9).
  • Interpret tensor or kernel semantics — fan-out dispatch by Python isinstance check only.

D2. Three fan-out paths dispatched by request type

At the terminal hop the worker dispatches by request type:

elif self.ctx is not None and txn.request is not None:
    if isinstance(txn.request, KernelLaunchMsg):
        env.process(self._kernel_launch_fanout(env, txn))
    elif isinstance(txn.request, (MmuMapMsg, MmuUnmapMsg)):
        env.process(self._mmu_msg_fanout(env, txn))
    else:
        env.process(self._dma_fanout(env, txn))

Each path uses a different router method:

  • _dma_fanout uses ctx.router.find_mcpu_dma_path() — the M_CPU-specific DMA path that avoids PE pipeline nodes.
  • _kernel_launch_fanout uses ctx.router.find_node_path() — the generic NOC command path to PE_CPU.
  • _mmu_msg_fanout uses ctx.router.find_node_path() — NOC command path to PE_MMU.

D3. M_CPU.DMA internal subcomponent (ADR-0015 D5)

MCpuComponent.start() initializes two SimPy resources:

self._dma_write = simpy.Resource(env, capacity=1)  # MemoryWriteMsg
self._dma_read  = simpy.Resource(env, capacity=1)  # MemoryReadMsg

Properties:

  • Not a topology node — managed entirely inside MCpuComponent; does not appear in topology.yaml or in the compiled graph.
  • Independent read and write channels — concurrent in-flight Memory R/W is allowed.
  • Capacity=1 per channel serializes the dispatch step (yield self.out_ports[...].put(...)) of concurrent in-flight Memory R/W requests at this M_CPU. Actual fabric transfer time is modeled by wire processes between components (ADR-0015 D2) and by drain_ns at terminal hops; the DMA resource does not gate transfer duration.

Resource selection is request-type-based:

dma_res = self._dma_write if isinstance(request, MemoryWriteMsg) else self._dma_read

D4. Transit forwarding at non-terminal hops

When txn.next_hop is not None — typical for the reverse response path (PE → M_CPU → IO_CPU) — the worker forwards normally:

if next_hop:
    yield self.out_ports[next_hop].put(txn.advance())

The fan-out branches fire only at the terminal hop. The same component therefore serves both forward command dispatch and reverse response relay roles.

D5. DMA fan-out (_dma_fanout — Memory R/W)

For each Memory R/W request at terminal hop:

  1. _resolve_dma_destinations(request) returns a per-PE hbm_ctrl.pe{X} derived from the request's PA via ctx.resolver.resolve(PhysAddr.decode(pa)) (ADR-0017 D9).
  2. For each destination:
    • Acquire the appropriate DMA resource (_dma_write or _dma_read) via with dma_res.request() as req.
    • Resolve path via ctx.router.find_mcpu_dma_path().
    • Compute drain_ns = ctx.compute_drain_ns(path, nbytes).
    • Create sub-Transaction carrying drain_ns and dispatch to path[1].
  3. Track max_drain_ns across destinations and record it as txn.result_data["xfer_ns"] after all responses arrive.
  4. After all per-PE responses are collected (D8), send an aggregate ResponseMsg on the reverse command path back to IO_CPU.

PA decode fallback (f"{cube_prefix}.hbm_ctrl") is legacy dead code — no such node exists after ADR-0017 D4's per-PE partitioning. Kept defensively but does not route to a real destination.

D6. Kernel launch fan-out (_kernel_launch_fanout)

For KernelLaunchMsg at terminal hop:

  1. _resolve_pe_ids(target_pe) → list of PE ids in this cube.

  2. For each PE: find path to f"{cube_prefix}.pe{pe_id}.pe_cpu" via ctx.router.find_node_path().

  3. target_start_ns handling (ADR-0009 D5):

    • If the request already carries target_start_ns (stamped by IO_CPU per ADR-0036 D3): pass through unchanged.
    • If absent (direct-to-M_CPU launch in unit tests): compute a per-cube barrier env.now + max(per-PE leg latency) and stamp via dataclasses.replace.
  4. Dispatch sub-Transactions with nbytes=0 (kernel launch is a control message; preserving nbytes=0 keeps fan-out off the shared first-hop fabric BW, mirroring ADR-0036 D4).

  5. After all per-PE responses arrive (D8), aggregate per-PE metrics from each sub-Transaction's result_data into the parent transaction:

    txn.result_data["pe_exec_ns"]  = max(existing, max(pe_exec_values))
    txn.result_data["dma_ns"]      = max(existing, max(dma_values))
    txn.result_data["compute_ns"]  = max(existing, max(compute_values))
    

    The max-merge with the existing value matters because cross-cube IO_CPU fan-out shares the same parent result_data; merging prevents one cube from clobbering another's metric.

  6. Send aggregate ResponseMsg on reverse path back to IO_CPU.

D7. MMU map/unmap fan-out (_mmu_msg_fanout)

For MmuMapMsg / MmuUnmapMsg at terminal hop:

  1. _resolve_pe_ids(target_pe) → PE ids.
  2. For each PE: find path to f"{cube_prefix}.pe{pe_id}.pe_mmu" via find_node_path().
  3. Dispatch sub-Transactions with nbytes=0.
  4. PE_MMU is a terminal node — it does not send a ResponseMsg back. Instead, the sub-Transaction's own sub_done event is the completion signal.
  5. Wait for all sub_done events in-line (does not use _pending counter — D8 is for response-bearing fan-out only).
  6. Send aggregate ResponseMsg on reverse path back to IO_CPU.

D8. Response aggregation (_pending + _parent_txns)

For DMA and kernel-launch fan-out (which expect per-PE ResponseMsg arriving on the reverse path):

self._pending: dict[str, tuple[int, int, simpy.Event]] = {}
self._parent_txns: dict[str, Any] = {}
  • On dispatch: register (expected, received=0, all_done) and remember the parent transaction.

  • _worker recognises responses by is_response=True and routes them to _collect_response, which increments received and signals all_done when received >= expected.

  • After yield all_done, the fan-out path constructs the aggregate ResponseMsg:

    resp_msg = ResponseMsg(
        correlation_id=request.correlation_id,
        request_id=request.request_id,
        src_cube=cube_id,
        src_pe=-1,             # -1 = M_CPU aggregate, not a single PE
        success=True,          # no failure semantics implemented
    )
    
  • The response Transaction travels on list(reversed(txn.path)) back to IO_CPU.

MMU fan-out (D7) uses a simpler in-line list of sub_done events because PE_MMU is terminal — there is no ResponseMsg path to intercept.

D9. Helpers and configurable attribute

_resolve_pe_ids(target_pe):

  • int[target_pe]
  • tuple[int, ...]list(target_pe)
  • "all"range(n_slices) where n_slices comes from cube memory_map.hbm_slices_per_cube (default 8).

Used by kernel-launch and MMU fan-out paths.

Single configurable attribute drives per-instance latency:

Site impl name overhead_ns
Cube m_cpu builtin.m_cpu 5.0

Applied once in run() per Transaction — models command interpretation and dispatch-decision time at M_CPU.

Consequences

Positive

  • Three fan-out paths are clearly separated by request type — adding a new request kind is an isinstance branch + one fan-out method.
  • M_CPU.DMA channels are independent (read and write run concurrently) and serialize only the dispatch step at capacity=1.
  • Transit-vs-terminal behavior is a single if next_hop check, so the same component handles forward dispatch and reverse response relay without role duplication.
  • target_start_ns passthrough (D6) preserves the cross-cube barrier established by IO_CPU (ADR-0036 D3), while the fallback computation keeps direct-to-M_CPU unit tests working.
  • Per-PE metric max-merge against existing parent result_data values is robust to cross-cube IO_CPU fan-out sharing the same parent.

Negative

  • No partial-failure semantics — a missing per-PE response stalls the parent all_done indefinitely. Acceptable for simulation; not suitable as a production-style endpoint.
  • _resolve_dma_destinations's cube-wide hbm_ctrl fallback is dead code (no such node exists post-ADR-0017 D4). Kept defensively; invites confusion and merits a follow-up cleanup.
  • DMA resource serialization applies only at dispatch (the put call is instantaneous in unbounded stores). The capacity=1 channel models "one request in flight at a time at this M_CPU", not "transfer duration serialization" — readers must consult wire processes (ADR-0015 D2) and drain_ns for actual transfer parallelism.
  • ADR-0009 D3 (M_CPU fan-out and aggregation completion semantics)
  • ADR-0009 D5 (target_start_ns — passed through unchanged when present; computed as per-cube barrier when absent)
  • ADR-0011 D-VA3 (MmuMapMsg fabric path includes M_CPU as PE fan-out point)
  • ADR-0014 D4 (DMA engine capacity=1; M_CPU.DMA mirrors the same contract at cube level)
  • ADR-0015 D5 (M_CPU.DMA is internal subcomponent of M_CPU, not a topology node)
  • ADR-0017 D9 (AddressResolver returns per-PE hbm_ctrl.pe{X})
  • ADR-0036 D3 / D4 (IO_CPU stamps target_start_ns; M_CPU passes through unchanged; nbytes=0 invariant preserved through fan-out)