Files
kernbench2/docs/adr-ko/ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md
T
ywkang a796c1d2f7 ADR: bilingual structure — EN canonical in adr/, KO mirror in adr-ko/
Establish English as the canonical ADR language with Korean translations
held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror).
Promotion from adr-proposed/ to adr/ now writes English to adr/ and the
Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md.

- Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English,
  2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix
  dropped). ADR-0023 EN regenerated against KO source which had newer
  HW Realization Notes (D16-D23) section.
- docs/adr-history/ left frozen by design (transitional state).
- CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark
  docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline
  section covering bidirectional sync, conflict resolution (EN wins),
  and proposed-language freedom.
- tools/verify_adr_lang_pairs.py: new verification tool checking pair
  completeness, filename mirroring, ADR-ID match, Status byte-equality.
  Pre-commit hook intentionally not added; run on demand or in CI.
- tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF
  normalization, em-dash title separator, underscore-slug edge case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:38:44 -07:00

11 KiB

ADR-0035: M_CPU and M_CPU.DMA Component Model

Status

Accepted

Context

M_CPU is the cube-level command processor. It receives commands from IO_CPU (or from PCIE_EP when the engine routes Memory R/W through M_CPU as a fallback), fans them out to the PEs in its cube, and aggregates per-PE responses into a single ResponseMsg sent back to IO_CPU on the reverse path.

M_CPU.DMA is the cube-level DMA channel pair that handles Memory R/W fan-out. Per ADR-0015 D5 it is not a separate topology node — it lives as internal state of MCpuComponent.

This ADR documents the M_CPU component implementation that realizes those responsibilities, including the three distinct fan-out paths (Memory R/W, Kernel Launch, MMU Map/Unmap), the M_CPU.DMA resource model, and the response aggregation contract.

Decision

D1. Role

M_CPU has three responsibilities:

  1. Transit forwarding — when not the terminal hop (e.g., on the reverse response path PE → M_CPU → IO_CPU), forwards Transactions to next_hop in their pre-computed path.
  2. Multi-PE fan-out at terminal hop — dispatches to one of three fan-out paths based on request type (D2).
  3. Response aggregation — collects per-PE responses, sends a single aggregate ResponseMsg back to IO_CPU on the reverse path.

Per invocation (run()): applies overhead_ns once per incoming Transaction.

M_CPU does not:

  • Decide routing — paths are pre-computed by the router (ADR-0002).
  • Handle PE-internal execution — PE_CPU / PE_SCHEDULER / engines (ADR-0014).
  • Decode addresses — ctx.resolver.resolve(pa) returns the per-PE hbm_ctrl.pe{X} directly (ADR-0017 D9).
  • Interpret tensor or kernel semantics — fan-out dispatch by Python isinstance check only.

D2. Three fan-out paths dispatched by request type

At the terminal hop the worker dispatches by request type:

elif self.ctx is not None and txn.request is not None:
    if isinstance(txn.request, KernelLaunchMsg):
        env.process(self._kernel_launch_fanout(env, txn))
    elif isinstance(txn.request, (MmuMapMsg, MmuUnmapMsg)):
        env.process(self._mmu_msg_fanout(env, txn))
    else:
        env.process(self._dma_fanout(env, txn))

Each path uses a different router method:

  • _dma_fanout uses ctx.router.find_mcpu_dma_path() — the M_CPU-specific DMA path that avoids PE pipeline nodes.
  • _kernel_launch_fanout uses ctx.router.find_node_path() — the generic NOC command path to PE_CPU.
  • _mmu_msg_fanout uses ctx.router.find_node_path() — NOC command path to PE_MMU.

D3. M_CPU.DMA internal subcomponent (ADR-0015 D5)

MCpuComponent.start() initializes two SimPy resources:

self._dma_write = simpy.Resource(env, capacity=1)  # MemoryWriteMsg
self._dma_read  = simpy.Resource(env, capacity=1)  # MemoryReadMsg

Properties:

  • Not a topology node — managed entirely inside MCpuComponent; does not appear in topology.yaml or in the compiled graph.
  • Independent read and write channels — concurrent in-flight Memory R/W is allowed.
  • Capacity=1 per channel serializes the dispatch step (yield self.out_ports[...].put(...)) of concurrent in-flight Memory R/W requests at this M_CPU. Actual fabric transfer time is modeled by wire processes between components (ADR-0015 D2) and by drain_ns at terminal hops; the DMA resource does not gate transfer duration.

Resource selection is request-type-based:

dma_res = self._dma_write if isinstance(request, MemoryWriteMsg) else self._dma_read

D4. Transit forwarding at non-terminal hops

When txn.next_hop is not None — typical for the reverse response path (PE → M_CPU → IO_CPU) — the worker forwards normally:

if next_hop:
    yield self.out_ports[next_hop].put(txn.advance())

The fan-out branches fire only at the terminal hop. The same component therefore serves both forward command dispatch and reverse response relay roles.

D5. DMA fan-out (_dma_fanout — Memory R/W)

For each Memory R/W request at terminal hop:

  1. _resolve_dma_destinations(request) returns a per-PE hbm_ctrl.pe{X} derived from the request's PA via ctx.resolver.resolve(PhysAddr.decode(pa)) (ADR-0017 D9).
  2. For each destination:
    • Acquire the appropriate DMA resource (_dma_write or _dma_read) via with dma_res.request() as req.
    • Resolve path via ctx.router.find_mcpu_dma_path().
    • Compute drain_ns = ctx.compute_drain_ns(path, nbytes).
    • Create sub-Transaction carrying drain_ns and dispatch to path[1].
  3. Track max_drain_ns across destinations and record it as txn.result_data["xfer_ns"] after all responses arrive.
  4. After all per-PE responses are collected (D8), send an aggregate ResponseMsg on the reverse command path back to IO_CPU.

PA decode fallback (f"{cube_prefix}.hbm_ctrl") is legacy dead code — no such node exists after ADR-0017 D4's per-PE partitioning. Kept defensively but does not route to a real destination.

D6. Kernel launch fan-out (_kernel_launch_fanout)

For KernelLaunchMsg at terminal hop:

  1. _resolve_pe_ids(target_pe) → list of PE ids in this cube.

  2. For each PE: find path to f"{cube_prefix}.pe{pe_id}.pe_cpu" via ctx.router.find_node_path().

  3. target_start_ns handling (ADR-0009 D5):

    • If the request already carries target_start_ns (stamped by IO_CPU per ADR-0036 D3): pass through unchanged.
    • If absent (direct-to-M_CPU launch in unit tests): compute a per-cube barrier env.now + max(per-PE leg latency) and stamp via dataclasses.replace.
  4. Dispatch sub-Transactions with nbytes=0 (kernel launch is a control message; preserving nbytes=0 keeps fan-out off the shared first-hop fabric BW, mirroring ADR-0036 D4).

  5. After all per-PE responses arrive (D8), aggregate per-PE metrics from each sub-Transaction's result_data into the parent transaction:

    txn.result_data["pe_exec_ns"]  = max(existing, max(pe_exec_values))
    txn.result_data["dma_ns"]      = max(existing, max(dma_values))
    txn.result_data["compute_ns"]  = max(existing, max(compute_values))
    

    The max-merge with the existing value matters because cross-cube IO_CPU fan-out shares the same parent result_data; merging prevents one cube from clobbering another's metric.

  6. Send aggregate ResponseMsg on reverse path back to IO_CPU.

D7. MMU map/unmap fan-out (_mmu_msg_fanout)

For MmuMapMsg / MmuUnmapMsg at terminal hop:

  1. _resolve_pe_ids(target_pe) → PE ids.
  2. For each PE: find path to f"{cube_prefix}.pe{pe_id}.pe_mmu" via find_node_path().
  3. Dispatch sub-Transactions with nbytes=0.
  4. PE_MMU is a terminal node — it does not send a ResponseMsg back. Instead, the sub-Transaction's own sub_done event is the completion signal.
  5. Wait for all sub_done events in-line (does not use _pending counter — D8 is for response-bearing fan-out only).
  6. Send aggregate ResponseMsg on reverse path back to IO_CPU.

D8. Response aggregation (_pending + _parent_txns)

For DMA and kernel-launch fan-out (which expect per-PE ResponseMsg arriving on the reverse path):

self._pending: dict[str, tuple[int, int, simpy.Event]] = {}
self._parent_txns: dict[str, Any] = {}
  • On dispatch: register (expected, received=0, all_done) and remember the parent transaction.

  • _worker recognises responses by is_response=True and routes them to _collect_response, which increments received and signals all_done when received >= expected.

  • After yield all_done, the fan-out path constructs the aggregate ResponseMsg:

    resp_msg = ResponseMsg(
        correlation_id=request.correlation_id,
        request_id=request.request_id,
        src_cube=cube_id,
        src_pe=-1,             # -1 = M_CPU aggregate, not a single PE
        success=True,          # no failure semantics implemented
    )
    
  • The response Transaction travels on list(reversed(txn.path)) back to IO_CPU.

MMU fan-out (D7) uses a simpler in-line list of sub_done events because PE_MMU is terminal — there is no ResponseMsg path to intercept.

D9. Helpers and configurable attribute

_resolve_pe_ids(target_pe):

  • int[target_pe]
  • tuple[int, ...]list(target_pe)
  • "all"range(n_slices) where n_slices comes from cube memory_map.hbm_slices_per_cube (default 8).

Used by kernel-launch and MMU fan-out paths.

Single configurable attribute drives per-instance latency:

Site impl name overhead_ns
Cube m_cpu builtin.m_cpu 5.0

Applied once in run() per Transaction — models command interpretation and dispatch-decision time at M_CPU.

Consequences

Positive

  • Three fan-out paths are clearly separated by request type — adding a new request kind is an isinstance branch + one fan-out method.
  • M_CPU.DMA channels are independent (read and write run concurrently) and serialize only the dispatch step at capacity=1.
  • Transit-vs-terminal behavior is a single if next_hop check, so the same component handles forward dispatch and reverse response relay without role duplication.
  • target_start_ns passthrough (D6) preserves the cross-cube barrier established by IO_CPU (ADR-0036 D3), while the fallback computation keeps direct-to-M_CPU unit tests working.
  • Per-PE metric max-merge against existing parent result_data values is robust to cross-cube IO_CPU fan-out sharing the same parent.

Negative

  • No partial-failure semantics — a missing per-PE response stalls the parent all_done indefinitely. Acceptable for simulation; not suitable as a production-style endpoint.
  • _resolve_dma_destinations's cube-wide hbm_ctrl fallback is dead code (no such node exists post-ADR-0017 D4). Kept defensively; invites confusion and merits a follow-up cleanup.
  • DMA resource serialization applies only at dispatch (the put call is instantaneous in unbounded stores). The capacity=1 channel models "one request in flight at a time at this M_CPU", not "transfer duration serialization" — readers must consult wire processes (ADR-0015 D2) and drain_ns for actual transfer parallelism.
  • ADR-0009 D3 (M_CPU fan-out and aggregation completion semantics)
  • ADR-0009 D5 (target_start_ns — passed through unchanged when present; computed as per-cube barrier when absent)
  • ADR-0011 D-VA3 (MmuMapMsg fabric path includes M_CPU as PE fan-out point)
  • ADR-0014 D4 (DMA engine capacity=1; M_CPU.DMA mirrors the same contract at cube level)
  • ADR-0015 D5 (M_CPU.DMA is internal subcomponent of M_CPU, not a topology node)
  • ADR-0017 D9 (AddressResolver returns per-PE hbm_ctrl.pe{X})
  • ADR-0036 D3 / D4 (IO_CPU stamps target_start_ns; M_CPU passes through unchanged; nbytes=0 invariant preserved through fan-out)