a796c1d2f7
Establish English as the canonical ADR language with Korean translations held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror). Promotion from adr-proposed/ to adr/ now writes English to adr/ and the Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md. - Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English, 2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix dropped). ADR-0023 EN regenerated against KO source which had newer HW Realization Notes (D16-D23) section. - docs/adr-history/ left frozen by design (transitional state). - CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline section covering bidirectional sync, conflict resolution (EN wins), and proposed-language freedom. - tools/verify_adr_lang_pairs.py: new verification tool checking pair completeness, filename mirroring, ADR-ID match, Status byte-equality. Pre-commit hook intentionally not added; run on demand or in CI. - tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF normalization, em-dash title separator, underscore-slug edge case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
287 lines
11 KiB
Markdown
287 lines
11 KiB
Markdown
# ADR-0035: M_CPU and M_CPU.DMA Component Model
|
|
|
|
## Status
|
|
|
|
Accepted
|
|
|
|
## Context
|
|
|
|
M_CPU is the cube-level command processor. It receives commands from
|
|
IO_CPU (or from PCIE_EP when the engine routes Memory R/W through
|
|
M_CPU as a fallback), fans them out to the PEs in its cube, and
|
|
aggregates per-PE responses into a single ResponseMsg sent back to
|
|
IO_CPU on the reverse path.
|
|
|
|
M_CPU.DMA is the cube-level DMA channel pair that handles Memory R/W
|
|
fan-out. Per ADR-0015 D5 it is **not** a separate topology node —
|
|
it lives as internal state of `MCpuComponent`.
|
|
|
|
This ADR documents the M_CPU component implementation that realizes
|
|
those responsibilities, including the three distinct fan-out paths
|
|
(Memory R/W, Kernel Launch, MMU Map/Unmap), the M_CPU.DMA resource
|
|
model, and the response aggregation contract.
|
|
|
|
## Decision
|
|
|
|
### D1. Role
|
|
|
|
M_CPU has three responsibilities:
|
|
|
|
1. **Transit forwarding** — when not the terminal hop (e.g., on the
|
|
reverse response path PE → M_CPU → IO_CPU), forwards Transactions
|
|
to `next_hop` in their pre-computed path.
|
|
2. **Multi-PE fan-out at terminal hop** — dispatches to one of three
|
|
fan-out paths based on request type (D2).
|
|
3. **Response aggregation** — collects per-PE responses, sends a
|
|
single aggregate ResponseMsg back to IO_CPU on the reverse path.
|
|
|
|
Per invocation (`run()`): applies `overhead_ns` once per incoming
|
|
Transaction.
|
|
|
|
M_CPU does **not**:
|
|
|
|
- Decide routing — paths are pre-computed by the router (ADR-0002).
|
|
- Handle PE-internal execution — PE_CPU / PE_SCHEDULER / engines
|
|
(ADR-0014).
|
|
- Decode addresses — `ctx.resolver.resolve(pa)` returns the per-PE
|
|
`hbm_ctrl.pe{X}` directly (ADR-0017 D9).
|
|
- Interpret tensor or kernel semantics — fan-out dispatch by Python
|
|
isinstance check only.
|
|
|
|
### D2. Three fan-out paths dispatched by request type
|
|
|
|
At the terminal hop the worker dispatches by request type:
|
|
|
|
```python
|
|
elif self.ctx is not None and txn.request is not None:
|
|
if isinstance(txn.request, KernelLaunchMsg):
|
|
env.process(self._kernel_launch_fanout(env, txn))
|
|
elif isinstance(txn.request, (MmuMapMsg, MmuUnmapMsg)):
|
|
env.process(self._mmu_msg_fanout(env, txn))
|
|
else:
|
|
env.process(self._dma_fanout(env, txn))
|
|
```
|
|
|
|
Each path uses a different router method:
|
|
|
|
- `_dma_fanout` uses `ctx.router.find_mcpu_dma_path()` — the
|
|
M_CPU-specific DMA path that avoids PE pipeline nodes.
|
|
- `_kernel_launch_fanout` uses `ctx.router.find_node_path()` — the
|
|
generic NOC command path to PE_CPU.
|
|
- `_mmu_msg_fanout` uses `ctx.router.find_node_path()` — NOC command
|
|
path to PE_MMU.
|
|
|
|
### D3. M_CPU.DMA internal subcomponent (ADR-0015 D5)
|
|
|
|
`MCpuComponent.start()` initializes two SimPy resources:
|
|
|
|
```python
|
|
self._dma_write = simpy.Resource(env, capacity=1) # MemoryWriteMsg
|
|
self._dma_read = simpy.Resource(env, capacity=1) # MemoryReadMsg
|
|
```
|
|
|
|
Properties:
|
|
|
|
- **Not a topology node** — managed entirely inside `MCpuComponent`;
|
|
does not appear in `topology.yaml` or in the compiled graph.
|
|
- **Independent read and write channels** — concurrent in-flight
|
|
Memory R/W is allowed.
|
|
- **Capacity=1 per channel** serializes the **dispatch step**
|
|
(`yield self.out_ports[...].put(...)`) of concurrent in-flight Memory
|
|
R/W requests at this M_CPU. Actual fabric transfer time is modeled
|
|
by wire processes between components (ADR-0015 D2) and by
|
|
`drain_ns` at terminal hops; the DMA resource does not gate
|
|
transfer duration.
|
|
|
|
Resource selection is request-type-based:
|
|
|
|
```python
|
|
dma_res = self._dma_write if isinstance(request, MemoryWriteMsg) else self._dma_read
|
|
```
|
|
|
|
### D4. Transit forwarding at non-terminal hops
|
|
|
|
When `txn.next_hop` is not None — typical for the reverse response
|
|
path (PE → M_CPU → IO_CPU) — the worker forwards normally:
|
|
|
|
```python
|
|
if next_hop:
|
|
yield self.out_ports[next_hop].put(txn.advance())
|
|
```
|
|
|
|
The fan-out branches fire only at the terminal hop. The same component
|
|
therefore serves both forward command dispatch and reverse response
|
|
relay roles.
|
|
|
|
### D5. DMA fan-out (`_dma_fanout` — Memory R/W)
|
|
|
|
For each Memory R/W request at terminal hop:
|
|
|
|
1. `_resolve_dma_destinations(request)` returns a per-PE
|
|
`hbm_ctrl.pe{X}` derived from the request's PA via
|
|
`ctx.resolver.resolve(PhysAddr.decode(pa))` (ADR-0017 D9).
|
|
2. For each destination:
|
|
- Acquire the appropriate DMA resource (`_dma_write` or
|
|
`_dma_read`) via `with dma_res.request() as req`.
|
|
- Resolve path via `ctx.router.find_mcpu_dma_path()`.
|
|
- Compute `drain_ns = ctx.compute_drain_ns(path, nbytes)`.
|
|
- Create sub-Transaction carrying `drain_ns` and dispatch to
|
|
`path[1]`.
|
|
3. Track `max_drain_ns` across destinations and record it as
|
|
`txn.result_data["xfer_ns"]` after all responses arrive.
|
|
4. After all per-PE responses are collected (D8), send an aggregate
|
|
ResponseMsg on the reverse command path back to IO_CPU.
|
|
|
|
PA decode fallback (`f"{cube_prefix}.hbm_ctrl"`) is legacy dead code —
|
|
no such node exists after ADR-0017 D4's per-PE partitioning. Kept
|
|
defensively but does not route to a real destination.
|
|
|
|
### D6. Kernel launch fan-out (`_kernel_launch_fanout`)
|
|
|
|
For `KernelLaunchMsg` at terminal hop:
|
|
|
|
1. `_resolve_pe_ids(target_pe)` → list of PE ids in this cube.
|
|
2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_cpu"` via
|
|
`ctx.router.find_node_path()`.
|
|
3. **`target_start_ns` handling** (ADR-0009 D5):
|
|
- If the request already carries `target_start_ns` (stamped by
|
|
IO_CPU per ADR-0036 D3): **pass through unchanged**.
|
|
- If absent (direct-to-M_CPU launch in unit tests): compute a
|
|
per-cube barrier `env.now + max(per-PE leg latency)` and stamp
|
|
via `dataclasses.replace`.
|
|
4. Dispatch sub-Transactions with `nbytes=0` (kernel launch is a
|
|
control message; preserving nbytes=0 keeps fan-out off the shared
|
|
first-hop fabric BW, mirroring ADR-0036 D4).
|
|
5. After all per-PE responses arrive (D8), aggregate per-PE metrics
|
|
from each sub-Transaction's `result_data` into the parent
|
|
transaction:
|
|
|
|
```python
|
|
txn.result_data["pe_exec_ns"] = max(existing, max(pe_exec_values))
|
|
txn.result_data["dma_ns"] = max(existing, max(dma_values))
|
|
txn.result_data["compute_ns"] = max(existing, max(compute_values))
|
|
```
|
|
|
|
The max-merge with the existing value matters because cross-cube
|
|
IO_CPU fan-out shares the same parent `result_data`; merging
|
|
prevents one cube from clobbering another's metric.
|
|
6. Send aggregate ResponseMsg on reverse path back to IO_CPU.
|
|
|
|
### D7. MMU map/unmap fan-out (`_mmu_msg_fanout`)
|
|
|
|
For `MmuMapMsg` / `MmuUnmapMsg` at terminal hop:
|
|
|
|
1. `_resolve_pe_ids(target_pe)` → PE ids.
|
|
2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_mmu"` via
|
|
`find_node_path()`.
|
|
3. Dispatch sub-Transactions with `nbytes=0`.
|
|
4. PE_MMU is a terminal node — it does **not** send a ResponseMsg
|
|
back. Instead, the sub-Transaction's own `sub_done` event is the
|
|
completion signal.
|
|
5. Wait for all `sub_done` events in-line (does **not** use
|
|
`_pending` counter — D8 is for response-bearing fan-out only).
|
|
6. Send aggregate ResponseMsg on reverse path back to IO_CPU.
|
|
|
|
### D8. Response aggregation (`_pending` + `_parent_txns`)
|
|
|
|
For DMA and kernel-launch fan-out (which expect per-PE ResponseMsg
|
|
arriving on the reverse path):
|
|
|
|
```python
|
|
self._pending: dict[str, tuple[int, int, simpy.Event]] = {}
|
|
self._parent_txns: dict[str, Any] = {}
|
|
```
|
|
|
|
- On dispatch: register `(expected, received=0, all_done)` and
|
|
remember the parent transaction.
|
|
- `_worker` recognises responses by `is_response=True` and routes
|
|
them to `_collect_response`, which increments `received` and
|
|
signals `all_done` when `received >= expected`.
|
|
- After `yield all_done`, the fan-out path constructs the aggregate
|
|
ResponseMsg:
|
|
|
|
```python
|
|
resp_msg = ResponseMsg(
|
|
correlation_id=request.correlation_id,
|
|
request_id=request.request_id,
|
|
src_cube=cube_id,
|
|
src_pe=-1, # -1 = M_CPU aggregate, not a single PE
|
|
success=True, # no failure semantics implemented
|
|
)
|
|
```
|
|
|
|
- The response Transaction travels on `list(reversed(txn.path))`
|
|
back to IO_CPU.
|
|
|
|
MMU fan-out (D7) uses a simpler in-line list of `sub_done` events
|
|
because PE_MMU is terminal — there is no ResponseMsg path to
|
|
intercept.
|
|
|
|
### D9. Helpers and configurable attribute
|
|
|
|
`_resolve_pe_ids(target_pe)`:
|
|
|
|
- `int` → `[target_pe]`
|
|
- `tuple[int, ...]` → `list(target_pe)`
|
|
- `"all"` → `range(n_slices)` where `n_slices` comes from cube
|
|
`memory_map.hbm_slices_per_cube` (default 8).
|
|
|
|
Used by kernel-launch and MMU fan-out paths.
|
|
|
|
Single configurable attribute drives per-instance latency:
|
|
|
|
| Site | impl name | overhead_ns |
|
|
| --- | --- | --- |
|
|
| Cube `m_cpu` | `builtin.m_cpu` | 5.0 |
|
|
|
|
Applied once in `run()` per Transaction — models command
|
|
interpretation and dispatch-decision time at M_CPU.
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
|
|
- Three fan-out paths are clearly separated by request type — adding
|
|
a new request kind is an isinstance branch + one fan-out method.
|
|
- M_CPU.DMA channels are independent (read and write run concurrently)
|
|
and serialize only the dispatch step at capacity=1.
|
|
- Transit-vs-terminal behavior is a single `if next_hop` check, so
|
|
the same component handles forward dispatch and reverse response
|
|
relay without role duplication.
|
|
- `target_start_ns` passthrough (D6) preserves the cross-cube barrier
|
|
established by IO_CPU (ADR-0036 D3), while the fallback computation
|
|
keeps direct-to-M_CPU unit tests working.
|
|
- Per-PE metric `max`-merge against existing parent `result_data`
|
|
values is robust to cross-cube IO_CPU fan-out sharing the same
|
|
parent.
|
|
|
|
### Negative
|
|
|
|
- No partial-failure semantics — a missing per-PE response stalls the
|
|
parent `all_done` indefinitely. Acceptable for simulation; not
|
|
suitable as a production-style endpoint.
|
|
- `_resolve_dma_destinations`'s cube-wide hbm_ctrl fallback is dead
|
|
code (no such node exists post-ADR-0017 D4). Kept defensively;
|
|
invites confusion and merits a follow-up cleanup.
|
|
- DMA resource serialization applies only at dispatch (the `put` call
|
|
is instantaneous in unbounded stores). The capacity=1 channel
|
|
models "one request in flight at a time at this M_CPU", not
|
|
"transfer duration serialization" — readers must consult wire
|
|
processes (ADR-0015 D2) and `drain_ns` for actual transfer
|
|
parallelism.
|
|
|
|
## Links
|
|
|
|
- ADR-0009 D3 (M_CPU fan-out and aggregation completion semantics)
|
|
- ADR-0009 D5 (`target_start_ns` — passed through unchanged when
|
|
present; computed as per-cube barrier when absent)
|
|
- ADR-0011 D-VA3 (MmuMapMsg fabric path includes M_CPU as PE fan-out
|
|
point)
|
|
- ADR-0014 D4 (DMA engine capacity=1; M_CPU.DMA mirrors the same
|
|
contract at cube level)
|
|
- ADR-0015 D5 (M_CPU.DMA is internal subcomponent of M_CPU, not a
|
|
topology node)
|
|
- ADR-0017 D9 (AddressResolver returns per-PE `hbm_ctrl.pe{X}`)
|
|
- ADR-0036 D3 / D4 (IO_CPU stamps `target_start_ns`; M_CPU passes
|
|
through unchanged; nbytes=0 invariant preserved through fan-out)
|