Establish English as the canonical ADR language with Korean translations held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror). Promotion from adr-proposed/ to adr/ now writes English to adr/ and the Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md. - Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English, 2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix dropped). ADR-0023 EN regenerated against KO source which had newer HW Realization Notes (D16-D23) section. - docs/adr-history/ left frozen by design (transitional state). - CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline section covering bidirectional sync, conflict resolution (EN wins), and proposed-language freedom. - tools/verify_adr_lang_pairs.py: new verification tool checking pair completeness, filename mirroring, ADR-ID match, Status byte-equality. Pre-commit hook intentionally not added; run on demand or in CI. - tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF normalization, em-dash title separator, underscore-slug edge case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
11 KiB
ADR-0035: M_CPU and M_CPU.DMA Component Model
Status
Accepted
Context
M_CPU is the cube-level command processor. It receives commands from IO_CPU (or from PCIE_EP when the engine routes Memory R/W through M_CPU as a fallback), fans them out to the PEs in its cube, and aggregates per-PE responses into a single ResponseMsg sent back to IO_CPU on the reverse path.
M_CPU.DMA is the cube-level DMA channel pair that handles Memory R/W
fan-out. Per ADR-0015 D5 it is not a separate topology node —
it lives as internal state of MCpuComponent.
This ADR documents the M_CPU component implementation that realizes those responsibilities, including the three distinct fan-out paths (Memory R/W, Kernel Launch, MMU Map/Unmap), the M_CPU.DMA resource model, and the response aggregation contract.
Decision
D1. Role
M_CPU has three responsibilities:
- Transit forwarding — when not the terminal hop (e.g., on the
reverse response path PE → M_CPU → IO_CPU), forwards Transactions
to
next_hopin their pre-computed path. - Multi-PE fan-out at terminal hop — dispatches to one of three fan-out paths based on request type (D2).
- Response aggregation — collects per-PE responses, sends a single aggregate ResponseMsg back to IO_CPU on the reverse path.
Per invocation (run()): applies overhead_ns once per incoming
Transaction.
M_CPU does not:
- Decide routing — paths are pre-computed by the router (ADR-0002).
- Handle PE-internal execution — PE_CPU / PE_SCHEDULER / engines (ADR-0014).
- Decode addresses —
ctx.resolver.resolve(pa)returns the per-PEhbm_ctrl.pe{X}directly (ADR-0017 D9). - Interpret tensor or kernel semantics — fan-out dispatch by Python isinstance check only.
D2. Three fan-out paths dispatched by request type
At the terminal hop the worker dispatches by request type:
elif self.ctx is not None and txn.request is not None:
if isinstance(txn.request, KernelLaunchMsg):
env.process(self._kernel_launch_fanout(env, txn))
elif isinstance(txn.request, (MmuMapMsg, MmuUnmapMsg)):
env.process(self._mmu_msg_fanout(env, txn))
else:
env.process(self._dma_fanout(env, txn))
Each path uses a different router method:
_dma_fanoutusesctx.router.find_mcpu_dma_path()— the M_CPU-specific DMA path that avoids PE pipeline nodes._kernel_launch_fanoutusesctx.router.find_node_path()— the generic NOC command path to PE_CPU._mmu_msg_fanoutusesctx.router.find_node_path()— NOC command path to PE_MMU.
D3. M_CPU.DMA internal subcomponent (ADR-0015 D5)
MCpuComponent.start() initializes two SimPy resources:
self._dma_write = simpy.Resource(env, capacity=1) # MemoryWriteMsg
self._dma_read = simpy.Resource(env, capacity=1) # MemoryReadMsg
Properties:
- Not a topology node — managed entirely inside
MCpuComponent; does not appear intopology.yamlor in the compiled graph. - Independent read and write channels — concurrent in-flight Memory R/W is allowed.
- Capacity=1 per channel serializes the dispatch step
(
yield self.out_ports[...].put(...)) of concurrent in-flight Memory R/W requests at this M_CPU. Actual fabric transfer time is modeled by wire processes between components (ADR-0015 D2) and bydrain_nsat terminal hops; the DMA resource does not gate transfer duration.
Resource selection is request-type-based:
dma_res = self._dma_write if isinstance(request, MemoryWriteMsg) else self._dma_read
D4. Transit forwarding at non-terminal hops
When txn.next_hop is not None — typical for the reverse response
path (PE → M_CPU → IO_CPU) — the worker forwards normally:
if next_hop:
yield self.out_ports[next_hop].put(txn.advance())
The fan-out branches fire only at the terminal hop. The same component therefore serves both forward command dispatch and reverse response relay roles.
D5. DMA fan-out (_dma_fanout — Memory R/W)
For each Memory R/W request at terminal hop:
_resolve_dma_destinations(request)returns a per-PEhbm_ctrl.pe{X}derived from the request's PA viactx.resolver.resolve(PhysAddr.decode(pa))(ADR-0017 D9).- For each destination:
- Acquire the appropriate DMA resource (
_dma_writeor_dma_read) viawith dma_res.request() as req. - Resolve path via
ctx.router.find_mcpu_dma_path(). - Compute
drain_ns = ctx.compute_drain_ns(path, nbytes). - Create sub-Transaction carrying
drain_nsand dispatch topath[1].
- Acquire the appropriate DMA resource (
- Track
max_drain_nsacross destinations and record it astxn.result_data["xfer_ns"]after all responses arrive. - After all per-PE responses are collected (D8), send an aggregate ResponseMsg on the reverse command path back to IO_CPU.
PA decode fallback (f"{cube_prefix}.hbm_ctrl") is legacy dead code —
no such node exists after ADR-0017 D4's per-PE partitioning. Kept
defensively but does not route to a real destination.
D6. Kernel launch fan-out (_kernel_launch_fanout)
For KernelLaunchMsg at terminal hop:
-
_resolve_pe_ids(target_pe)→ list of PE ids in this cube. -
For each PE: find path to
f"{cube_prefix}.pe{pe_id}.pe_cpu"viactx.router.find_node_path(). -
target_start_nshandling (ADR-0009 D5):- If the request already carries
target_start_ns(stamped by IO_CPU per ADR-0036 D3): pass through unchanged. - If absent (direct-to-M_CPU launch in unit tests): compute a
per-cube barrier
env.now + max(per-PE leg latency)and stamp viadataclasses.replace.
- If the request already carries
-
Dispatch sub-Transactions with
nbytes=0(kernel launch is a control message; preserving nbytes=0 keeps fan-out off the shared first-hop fabric BW, mirroring ADR-0036 D4). -
After all per-PE responses arrive (D8), aggregate per-PE metrics from each sub-Transaction's
result_datainto the parent transaction:txn.result_data["pe_exec_ns"] = max(existing, max(pe_exec_values)) txn.result_data["dma_ns"] = max(existing, max(dma_values)) txn.result_data["compute_ns"] = max(existing, max(compute_values))The max-merge with the existing value matters because cross-cube IO_CPU fan-out shares the same parent
result_data; merging prevents one cube from clobbering another's metric. -
Send aggregate ResponseMsg on reverse path back to IO_CPU.
D7. MMU map/unmap fan-out (_mmu_msg_fanout)
For MmuMapMsg / MmuUnmapMsg at terminal hop:
_resolve_pe_ids(target_pe)→ PE ids.- For each PE: find path to
f"{cube_prefix}.pe{pe_id}.pe_mmu"viafind_node_path(). - Dispatch sub-Transactions with
nbytes=0. - PE_MMU is a terminal node — it does not send a ResponseMsg
back. Instead, the sub-Transaction's own
sub_doneevent is the completion signal. - Wait for all
sub_doneevents in-line (does not use_pendingcounter — D8 is for response-bearing fan-out only). - Send aggregate ResponseMsg on reverse path back to IO_CPU.
D8. Response aggregation (_pending + _parent_txns)
For DMA and kernel-launch fan-out (which expect per-PE ResponseMsg arriving on the reverse path):
self._pending: dict[str, tuple[int, int, simpy.Event]] = {}
self._parent_txns: dict[str, Any] = {}
-
On dispatch: register
(expected, received=0, all_done)and remember the parent transaction. -
_workerrecognises responses byis_response=Trueand routes them to_collect_response, which incrementsreceivedand signalsall_donewhenreceived >= expected. -
After
yield all_done, the fan-out path constructs the aggregate ResponseMsg:resp_msg = ResponseMsg( correlation_id=request.correlation_id, request_id=request.request_id, src_cube=cube_id, src_pe=-1, # -1 = M_CPU aggregate, not a single PE success=True, # no failure semantics implemented ) -
The response Transaction travels on
list(reversed(txn.path))back to IO_CPU.
MMU fan-out (D7) uses a simpler in-line list of sub_done events
because PE_MMU is terminal — there is no ResponseMsg path to
intercept.
D9. Helpers and configurable attribute
_resolve_pe_ids(target_pe):
int→[target_pe]tuple[int, ...]→list(target_pe)"all"→range(n_slices)wheren_slicescomes from cubememory_map.hbm_slices_per_cube(default 8).
Used by kernel-launch and MMU fan-out paths.
Single configurable attribute drives per-instance latency:
| Site | impl name | overhead_ns |
|---|---|---|
Cube m_cpu |
builtin.m_cpu |
5.0 |
Applied once in run() per Transaction — models command
interpretation and dispatch-decision time at M_CPU.
Consequences
Positive
- Three fan-out paths are clearly separated by request type — adding a new request kind is an isinstance branch + one fan-out method.
- M_CPU.DMA channels are independent (read and write run concurrently) and serialize only the dispatch step at capacity=1.
- Transit-vs-terminal behavior is a single
if next_hopcheck, so the same component handles forward dispatch and reverse response relay without role duplication. target_start_nspassthrough (D6) preserves the cross-cube barrier established by IO_CPU (ADR-0036 D3), while the fallback computation keeps direct-to-M_CPU unit tests working.- Per-PE metric
max-merge against existing parentresult_datavalues is robust to cross-cube IO_CPU fan-out sharing the same parent.
Negative
- No partial-failure semantics — a missing per-PE response stalls the
parent
all_doneindefinitely. Acceptable for simulation; not suitable as a production-style endpoint. _resolve_dma_destinations's cube-wide hbm_ctrl fallback is dead code (no such node exists post-ADR-0017 D4). Kept defensively; invites confusion and merits a follow-up cleanup.- DMA resource serialization applies only at dispatch (the
putcall is instantaneous in unbounded stores). The capacity=1 channel models "one request in flight at a time at this M_CPU", not "transfer duration serialization" — readers must consult wire processes (ADR-0015 D2) anddrain_nsfor actual transfer parallelism.
Links
- ADR-0009 D3 (M_CPU fan-out and aggregation completion semantics)
- ADR-0009 D5 (
target_start_ns— passed through unchanged when present; computed as per-cube barrier when absent) - ADR-0011 D-VA3 (MmuMapMsg fabric path includes M_CPU as PE fan-out point)
- ADR-0014 D4 (DMA engine capacity=1; M_CPU.DMA mirrors the same contract at cube level)
- ADR-0015 D5 (M_CPU.DMA is internal subcomponent of M_CPU, not a topology node)
- ADR-0017 D9 (AddressResolver returns per-PE
hbm_ctrl.pe{X}) - ADR-0036 D3 / D4 (IO_CPU stamps
target_start_ns; M_CPU passes through unchanged; nbytes=0 invariant preserved through fan-out)