Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
(dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
retroactive docs pending verification.
Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
deleted; ADR-0019/0021 moved to adr-history with one-line stub status
Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
selection, flit-aware per-flit commit, async finalize, command-only
fallback path)
Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
"Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
(now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)
Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py
Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.
Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
(ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
11 KiB
ADR-0035: M_CPU and M_CPU.DMA Component Model
Status
Accepted
Context
M_CPU is the cube-level command processor. It receives commands from IO_CPU (or from PCIE_EP when the engine routes Memory R/W through M_CPU as a fallback), fans them out to the PEs in its cube, and aggregates per-PE responses into a single ResponseMsg sent back to IO_CPU on the reverse path.
M_CPU.DMA is the cube-level DMA channel pair that handles Memory R/W
fan-out. Per ADR-0015 D5 it is not a separate topology node —
it lives as internal state of MCpuComponent.
This ADR documents the M_CPU component implementation that realizes those responsibilities, including the three distinct fan-out paths (Memory R/W, Kernel Launch, MMU Map/Unmap), the M_CPU.DMA resource model, and the response aggregation contract.
Decision
D1. Role
M_CPU has three responsibilities:
- Transit forwarding — when not the terminal hop (e.g., on the
reverse response path PE → M_CPU → IO_CPU), forwards Transactions
to
next_hopin their pre-computed path. - Multi-PE fan-out at terminal hop — dispatches to one of three fan-out paths based on request type (D2).
- Response aggregation — collects per-PE responses, sends a single aggregate ResponseMsg back to IO_CPU on the reverse path.
Per invocation (run()): applies overhead_ns once per incoming
Transaction.
M_CPU does not:
- Decide routing — paths are pre-computed by the router (ADR-0002).
- Handle PE-internal execution — PE_CPU / PE_SCHEDULER / engines (ADR-0014).
- Decode addresses —
ctx.resolver.resolve(pa)returns the per-PEhbm_ctrl.pe{X}directly (ADR-0017 D9). - Interpret tensor or kernel semantics — fan-out dispatch by Python isinstance check only.
D2. Three fan-out paths dispatched by request type
At the terminal hop the worker dispatches by request type:
elif self.ctx is not None and txn.request is not None:
if isinstance(txn.request, KernelLaunchMsg):
env.process(self._kernel_launch_fanout(env, txn))
elif isinstance(txn.request, (MmuMapMsg, MmuUnmapMsg)):
env.process(self._mmu_msg_fanout(env, txn))
else:
env.process(self._dma_fanout(env, txn))
Each path uses a different router method:
_dma_fanoutusesctx.router.find_mcpu_dma_path()— the M_CPU-specific DMA path that avoids PE pipeline nodes._kernel_launch_fanoutusesctx.router.find_node_path()— the generic NOC command path to PE_CPU._mmu_msg_fanoutusesctx.router.find_node_path()— NOC command path to PE_MMU.
D3. M_CPU.DMA internal subcomponent (ADR-0015 D5)
MCpuComponent.start() initializes two SimPy resources:
self._dma_write = simpy.Resource(env, capacity=1) # MemoryWriteMsg
self._dma_read = simpy.Resource(env, capacity=1) # MemoryReadMsg
Properties:
- Not a topology node — managed entirely inside
MCpuComponent; does not appear intopology.yamlor in the compiled graph. - Independent read and write channels — concurrent in-flight Memory R/W is allowed.
- Capacity=1 per channel serializes the dispatch step
(
yield self.out_ports[...].put(...)) of concurrent in-flight Memory R/W requests at this M_CPU. Actual fabric transfer time is modeled by wire processes between components (ADR-0015 D2) and bydrain_nsat terminal hops; the DMA resource does not gate transfer duration.
Resource selection is request-type-based:
dma_res = self._dma_write if isinstance(request, MemoryWriteMsg) else self._dma_read
D4. Transit forwarding at non-terminal hops
When txn.next_hop is not None — typical for the reverse response
path (PE → M_CPU → IO_CPU) — the worker forwards normally:
if next_hop:
yield self.out_ports[next_hop].put(txn.advance())
The fan-out branches fire only at the terminal hop. The same component therefore serves both forward command dispatch and reverse response relay roles.
D5. DMA fan-out (_dma_fanout — Memory R/W)
For each Memory R/W request at terminal hop:
_resolve_dma_destinations(request)returns a per-PEhbm_ctrl.pe{X}derived from the request's PA viactx.resolver.resolve(PhysAddr.decode(pa))(ADR-0017 D9).- For each destination:
- Acquire the appropriate DMA resource (
_dma_writeor_dma_read) viawith dma_res.request() as req. - Resolve path via
ctx.router.find_mcpu_dma_path(). - Compute
drain_ns = ctx.compute_drain_ns(path, nbytes). - Create sub-Transaction carrying
drain_nsand dispatch topath[1].
- Acquire the appropriate DMA resource (
- Track
max_drain_nsacross destinations and record it astxn.result_data["xfer_ns"]after all responses arrive. - After all per-PE responses are collected (D8), send an aggregate ResponseMsg on the reverse command path back to IO_CPU.
PA decode fallback (f"{cube_prefix}.hbm_ctrl") is legacy dead code —
no such node exists after ADR-0017 D4's per-PE partitioning. Kept
defensively but does not route to a real destination.
D6. Kernel launch fan-out (_kernel_launch_fanout)
For KernelLaunchMsg at terminal hop:
-
_resolve_pe_ids(target_pe)→ list of PE ids in this cube. -
For each PE: find path to
f"{cube_prefix}.pe{pe_id}.pe_cpu"viactx.router.find_node_path(). -
target_start_nshandling (ADR-0009 D5):- If the request already carries
target_start_ns(stamped by IO_CPU per ADR-0036 D3): pass through unchanged. - If absent (direct-to-M_CPU launch in unit tests): compute a
per-cube barrier
env.now + max(per-PE leg latency)and stamp viadataclasses.replace.
- If the request already carries
-
Dispatch sub-Transactions with
nbytes=0(kernel launch is a control message; preserving nbytes=0 keeps fan-out off the shared first-hop fabric BW, mirroring ADR-0036 D4). -
After all per-PE responses arrive (D8), aggregate per-PE metrics from each sub-Transaction's
result_datainto the parent transaction:txn.result_data["pe_exec_ns"] = max(existing, max(pe_exec_values)) txn.result_data["dma_ns"] = max(existing, max(dma_values)) txn.result_data["compute_ns"] = max(existing, max(compute_values))The max-merge with the existing value matters because cross-cube IO_CPU fan-out shares the same parent
result_data; merging prevents one cube from clobbering another's metric. -
Send aggregate ResponseMsg on reverse path back to IO_CPU.
D7. MMU map/unmap fan-out (_mmu_msg_fanout)
For MmuMapMsg / MmuUnmapMsg at terminal hop:
_resolve_pe_ids(target_pe)→ PE ids.- For each PE: find path to
f"{cube_prefix}.pe{pe_id}.pe_mmu"viafind_node_path(). - Dispatch sub-Transactions with
nbytes=0. - PE_MMU is a terminal node — it does not send a ResponseMsg
back. Instead, the sub-Transaction's own
sub_doneevent is the completion signal. - Wait for all
sub_doneevents in-line (does not use_pendingcounter — D8 is for response-bearing fan-out only). - Send aggregate ResponseMsg on reverse path back to IO_CPU.
D8. Response aggregation (_pending + _parent_txns)
For DMA and kernel-launch fan-out (which expect per-PE ResponseMsg arriving on the reverse path):
self._pending: dict[str, tuple[int, int, simpy.Event]] = {}
self._parent_txns: dict[str, Any] = {}
-
On dispatch: register
(expected, received=0, all_done)and remember the parent transaction. -
_workerrecognises responses byis_response=Trueand routes them to_collect_response, which incrementsreceivedand signalsall_donewhenreceived >= expected. -
After
yield all_done, the fan-out path constructs the aggregate ResponseMsg:resp_msg = ResponseMsg( correlation_id=request.correlation_id, request_id=request.request_id, src_cube=cube_id, src_pe=-1, # -1 = M_CPU aggregate, not a single PE success=True, # no failure semantics implemented ) -
The response Transaction travels on
list(reversed(txn.path))back to IO_CPU.
MMU fan-out (D7) uses a simpler in-line list of sub_done events
because PE_MMU is terminal — there is no ResponseMsg path to
intercept.
D9. Helpers and configurable attribute
_resolve_pe_ids(target_pe):
int→[target_pe]tuple[int, ...]→list(target_pe)"all"→range(n_slices)wheren_slicescomes from cubememory_map.hbm_slices_per_cube(default 8).
Used by kernel-launch and MMU fan-out paths.
Single configurable attribute drives per-instance latency:
| Site | impl name | overhead_ns |
|---|---|---|
Cube m_cpu |
builtin.m_cpu |
5.0 |
Applied once in run() per Transaction — models command
interpretation and dispatch-decision time at M_CPU.
Consequences
Positive
- Three fan-out paths are clearly separated by request type — adding a new request kind is an isinstance branch + one fan-out method.
- M_CPU.DMA channels are independent (read and write run concurrently) and serialize only the dispatch step at capacity=1.
- Transit-vs-terminal behavior is a single
if next_hopcheck, so the same component handles forward dispatch and reverse response relay without role duplication. target_start_nspassthrough (D6) preserves the cross-cube barrier established by IO_CPU (ADR-0036 D3), while the fallback computation keeps direct-to-M_CPU unit tests working.- Per-PE metric
max-merge against existing parentresult_datavalues is robust to cross-cube IO_CPU fan-out sharing the same parent.
Negative
- No partial-failure semantics — a missing per-PE response stalls the
parent
all_doneindefinitely. Acceptable for simulation; not suitable as a production-style endpoint. _resolve_dma_destinations's cube-wide hbm_ctrl fallback is dead code (no such node exists post-ADR-0017 D4). Kept defensively; invites confusion and merits a follow-up cleanup.- DMA resource serialization applies only at dispatch (the
putcall is instantaneous in unbounded stores). The capacity=1 channel models "one request in flight at a time at this M_CPU", not "transfer duration serialization" — readers must consult wire processes (ADR-0015 D2) anddrain_nsfor actual transfer parallelism.
Links
- ADR-0009 D3 (M_CPU fan-out and aggregation completion semantics)
- ADR-0009 D5 (
target_start_ns— passed through unchanged when present; computed as per-cube barrier when absent) - ADR-0011 D-VA3 (MmuMapMsg fabric path includes M_CPU as PE fan-out point)
- ADR-0014 D4 (DMA engine capacity=1; M_CPU.DMA mirrors the same contract at cube level)
- ADR-0015 D5 (M_CPU.DMA is internal subcomponent of M_CPU, not a topology node)
- ADR-0017 D9 (AddressResolver returns per-PE
hbm_ctrl.pe{X}) - ADR-0036 D3 / D4 (IO_CPU stamps
target_start_ns; M_CPU passes through unchanged; nbytes=0 invariant preserved through fan-out)