687c98086d
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
(dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
retroactive docs pending verification.
Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
deleted; ADR-0019/0021 moved to adr-history with one-line stub status
Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
selection, flit-aware per-flit commit, async finalize, command-only
fallback path)
Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
"Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
(now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)
Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py
Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.
Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
(ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
522 lines
18 KiB
Markdown
522 lines
18 KiB
Markdown
# ADR-0011: Memory Addressing — PA / VA / LA Address Models
|
||
|
||
## Status
|
||
|
||
Accepted.
|
||
|
||
- **VA model: currently implemented (default).**
|
||
- PA model: implemented as PageFault fallback in PE_DMA.
|
||
- LA model: proposed, not implemented.
|
||
|
||
## Context
|
||
|
||
KernBench's address model evolved through three design points, each
|
||
addressing a limitation of the previous. This ADR documents all three
|
||
in one place because future implementation work selects among them.
|
||
|
||
### PA-only baseline
|
||
|
||
Phase 0 of KernBench treated all device memory operations
|
||
(MemoryRead/MemoryWrite) as raw physical-address transfers. No
|
||
host-side virtual addressing, no MMU/IOMMU translation. Allocators
|
||
returned PA mappings; DMA requests carried PA directly.
|
||
|
||
This was sufficient for early correctness/latency work but
|
||
insufficient for running standard Triton kernels that use
|
||
`base_addr + offset` patterns on sharded tensors: each PE's shard
|
||
has a different PA, but the kernel needs a single contiguous address
|
||
space to compute offsets.
|
||
|
||
### Why VA/MMU (current default)
|
||
|
||
A realistic system uses host-side virtual addressing and an
|
||
MMU/IOMMU-style translation path for DMA: the host allocates physical
|
||
memory at PE level, maps it into a virtual address space, installs
|
||
mappings, and DMA requests use virtual addresses that are translated
|
||
to physical addresses.
|
||
|
||
Adopting this model lets kernels use `base_addr + offset` over a
|
||
contiguous VA range while the device-side MMU translates each access
|
||
to the appropriate PA.
|
||
|
||
### Why LA/BAAW (proposed)
|
||
|
||
VA/MMU treats HBM as a single backing space. KernBench needs to
|
||
explore architectures where HBM is composed of multiple pseudo
|
||
channels in parallel:
|
||
|
||
- CUBE's HBM has 32 or 64 pseudo channels.
|
||
- In a PE-Local-HBM model, each PE is assigned N pseudo channels
|
||
(N = `hbm_pseudo_channels / pes_per_cube`).
|
||
- Per-channel BW (e.g. 32 GB/s) determines aggregate PE BW
|
||
(N × per-channel).
|
||
|
||
Two channel-mapping modes need to be modelable:
|
||
|
||
- **1:1 mode** — one logical access → N per-channel requests.
|
||
Precise per-channel BW contention modelling.
|
||
- **n:1 mode (default)** — one logical access → one aggregated
|
||
request. Channels are assumed to interleave; aggregated BW model.
|
||
|
||
VA's `tl.load(va_ptr)` produces a single DMA request to a single
|
||
target. Decomposing that into per-channel requests inside PE_DMA
|
||
requires the address layer to be aware of channels. This is the
|
||
role of the LA (Logical Address) abstraction with BAAW
|
||
(Logical-to-Physical Mapping Unit).
|
||
|
||
Core requirements driving the LA design:
|
||
|
||
- PE_DMA → HBM_CTRL effective bandwidth semantics must be identical
|
||
in both modes (only request shape and resource model differ).
|
||
- Kernel programming model is unchanged — physical channel
|
||
information is never exposed to kernel code.
|
||
- Mode switch is a topology-level configuration.
|
||
|
||
### Design space summary
|
||
|
||
| Model | Status | Key idea |
|
||
|-------|--------|----------|
|
||
| PA | fallback (implemented) | Direct physical addressing, no translation |
|
||
| VA | current default (implemented) | Per-tensor contiguous VA range; MMU translates per access |
|
||
| LA | proposed | LA + BAAW resolves to (PA, channel); supports 1:1 and n:1 channel mapping modes |
|
||
|
||
---
|
||
|
||
## Decision
|
||
|
||
This ADR defines three address models. At any given time the system
|
||
operates in exactly one model. Selection is topology- / configuration-
|
||
driven; coexistence within one simulation run is not required.
|
||
|
||
---
|
||
|
||
### Address Model: PA (Physical Address) — fallback
|
||
|
||
#### D-PA1. PA-only semantics
|
||
|
||
- All device memory accesses (MemoryRead/MemoryWrite) operate on
|
||
device physical addresses (PA) plus size.
|
||
- PA-only mode remains functional via the PageFault fallback path in
|
||
PE_DMA: if a DMA src/dst address has no MMU mapping, PE_DMA treats
|
||
the value as a PA directly.
|
||
|
||
#### D-PA2. Allocation produces PA mappings
|
||
|
||
Device allocation selects PE-local memory regions and returns PA
|
||
mappings sufficient to execute kernels and issue DMA requests.
|
||
|
||
PA model is retained primarily for backward compatibility with PA-only
|
||
tests and as the underlying physical layer that VA / LA models resolve
|
||
into.
|
||
|
||
---
|
||
|
||
### Address Model: VA (Virtual Address with MMU) — current default
|
||
|
||
#### D-VA1. Virtual Address Model
|
||
|
||
- Each tensor gets a single contiguous VA range (`TensorHandle.va_base`).
|
||
- `TensorShard` does NOT carry a `va` field — shard VA is derived as
|
||
`va_base + offset_bytes`.
|
||
- Kernels receive `va_base` as their pointer argument (via
|
||
`TensorArg.va_base`).
|
||
- `DmaReadCmd.src_addr` and `DmaWriteCmd.dst_addr` carry VA (not PA).
|
||
|
||
#### D-VA2. PE_MMU Component
|
||
|
||
- Hybrid design: SimPy component (inbox for `MmuMapMsg`) + utility
|
||
(synchronous `translate()` called by PE_DMA).
|
||
- Page-aligned dict lookup for O(1) VA → PA translation.
|
||
- `tlb_overhead_ns` configurable per-access latency.
|
||
- PageFault fallback: if VA has no mapping, PE_DMA treats it as PA
|
||
directly (preserves PA model for backward compatibility).
|
||
|
||
#### D-VA3. Mapping Installation
|
||
|
||
- `MmuMapMsg` traverses the fabric: Host → PCIE_EP → IO_CPU (cube
|
||
fan-out) → M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured
|
||
end-to-end.
|
||
- `MmuMapMsg.target_sips` controls SIP-level routing to prevent
|
||
cross-SIP mapping contamination for replicated tensors.
|
||
- Mapping strategy based on `DPPolicy.cube`:
|
||
- **Replicate** (`cube="replicate"`): per-(sip, cube) local mapping
|
||
only. Each cube's PEs see only their local PA. No cross-cube
|
||
mapping installed.
|
||
- **Sharded** (`cube="column_wise"`, etc.): broadcast all shard
|
||
mappings to all target cubes. Enables cross-PE and cross-cube
|
||
DMA.
|
||
|
||
#### D-VA4. Tensor Lifecycle
|
||
|
||
- `del tensor` triggers automatic cleanup via `Tensor.__del__` +
|
||
`weakref` to `RuntimeContext`. Sends `MmuUnmapMsg` through fabric,
|
||
returns VA and PA space.
|
||
- `with RuntimeContext(...) as ctx:` provides scope-based bulk cleanup.
|
||
- `RuntimeContext._tensors` uses `weakref.ref` to avoid preventing GC.
|
||
- `PEMemAllocator` uses free-list with coalescing (not bump allocator).
|
||
- `VirtualAllocator` uses free-list with coalescing for VA space.
|
||
|
||
#### D-VA5. Allocators
|
||
|
||
- `VirtualAllocator`: device-wide VA space, page-aligned alloc/free
|
||
with coalescing.
|
||
- `PEMemAllocator`: per-PE HBM/TCM, free-list based alloc/free with
|
||
coalescing.
|
||
- Page size configurable via `topology.yaml` `pe_mmu` attrs
|
||
(default 4096).
|
||
|
||
#### Consequences (VA model)
|
||
|
||
- Triton kernels use `base_addr + offset` patterns naturally on
|
||
sharded tensors.
|
||
- All latency remains explicit via graph traversal, including MMU
|
||
mapping installation and per-access TLB overhead.
|
||
- PA-only mode retained as fallback (PageFault → treat as PA).
|
||
- IPCQ and other fixed-address resources bypass MMU (use PA directly).
|
||
|
||
---
|
||
|
||
### Address Model: LA (Logical Address with BAAW) — proposed
|
||
|
||
LA replaces VA when channel-level HBM modelling is required.
|
||
Adopting this model removes the VA/MMU infrastructure (D-LA1 lists the
|
||
removed artifacts). Coexistence with VA in the same run is not a goal.
|
||
|
||
#### D-LA1. LA introduction — replaces VA infrastructure
|
||
|
||
LA is the sole address space used by kernel code (`tl.load`,
|
||
`tl.store`, `tl.composite`). Properties:
|
||
|
||
- Can map a Tensor to a contiguous logical space (like VA).
|
||
- Expresses `(logical buffer + offset)`.
|
||
- Does NOT contain physical channel information directly.
|
||
- Stays as an intermediate abstraction until physical resolution.
|
||
|
||
LA address space:
|
||
|
||
| Item | Value |
|
||
|------|-------|
|
||
| LA start | `0x1_0000_0000` (4 GB, preserves former VA start) |
|
||
| LA space size | 64 GB per PE |
|
||
| Alignment unit | segment (see D-LA3) |
|
||
|
||
LA is PE-local: different PEs may use the same LA value; BAAW segment
|
||
tables differ → they resolve to different PAs.
|
||
|
||
VA infrastructure removed when LA is adopted:
|
||
|
||
| Removed | Replacement |
|
||
|---------|-------------|
|
||
| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (same free-list approach, renamed) |
|
||
| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (inside PE_DMA) |
|
||
| `components/builtin/pe_mmu.py` (PeMmuComponent) | Removed — BAAW is internal PE_DMA logic, not a separate component |
|
||
| `runtime_api/kernel.py`: `MmuMapMsg`, `MmuUnmapMsg` | `BaawSegmentInstallMsg` |
|
||
| `runtime_api/context.py`: VA alloc + MMU install | LA alloc + BAAW segment install |
|
||
| `runtime_api/tensor.py`: `va_base` | `la_base` |
|
||
| `topology.yaml`: `pe_mmu` component entry | Removed |
|
||
|
||
#### D-LA2. Mapping mode setting
|
||
|
||
Topology-level (cube) configuration:
|
||
|
||
```yaml
|
||
cube:
|
||
memory_map:
|
||
hbm_mapping_mode: n_to_one # one_to_one | n_to_one
|
||
hbm_pseudo_channels: 64 # total pseudo channel count
|
||
hbm_channels_per_pe: 8 # per-PE local channel count
|
||
hbm_channel_bw_gbs: 32.0 # per-channel bandwidth
|
||
```
|
||
|
||
Consumed by the graph compiler (topology builder) and BAAW
|
||
initialisation.
|
||
|
||
#### D-LA3. Segment and BAAW
|
||
|
||
Segment partitions the LA space; each segment maps to a specific HBM
|
||
channel or channel group. Created at tensor deploy time by the runtime
|
||
allocator. BAAW resolves LA → physical request(s) using the segment
|
||
table.
|
||
|
||
```python
|
||
@dataclass
|
||
class BaawSegment:
|
||
la_base: int # segment start LA
|
||
la_size: int # segment size (bytes)
|
||
mode: str # "one_to_one" | "n_to_one"
|
||
# 1:1 mode fields
|
||
channel_count: int # channels assigned to this segment (e.g. 8)
|
||
pa_bases: list[int] # per-channel PA bases (len = channel_count)
|
||
channel_ids: list[int] # per-channel logical IDs (e.g. [0..7])
|
||
channel_size: int # per-channel size (la_size // channel_count)
|
||
# n:1 mode fields
|
||
agg_pa_base: int # aggregated PA base
|
||
agg_node_id: str # aggregated router node_id
|
||
```
|
||
|
||
Segment lifecycle:
|
||
|
||
1. **Allocate** (tensor deploy): RuntimeContext allocates LA from LA
|
||
allocator. PEMemAllocator allocates per-channel PA (1:1) or
|
||
aggregated PA (n:1). `BaawSegmentInstallMsg` registers the segment
|
||
with PE_DMA.
|
||
2. **Use** (kernel run): kernel `tl.load(la_ptr)` → `DmaReadCmd
|
||
(src_addr=LA)`. PE_DMA's BAAW front-end looks up the segment and
|
||
converts to PA(s).
|
||
3. **Free** (tensor free): segment removed from table; LA and PA
|
||
returned.
|
||
|
||
#### D-LA4. BAAW resolution logic
|
||
|
||
BAAW is a front-end stage inside PE_DMA, not a separate SimPy
|
||
component. Synchronous address-resolution logic executed at the start
|
||
of PE_DMA's `handle_command()`.
|
||
|
||
Input: `(LA, nbytes)`. Output:
|
||
|
||
- **1:1 mode**: `list[PhysicalRequest]` — one per channel.
|
||
- **n:1 mode**: single `PhysicalRequest`.
|
||
|
||
```python
|
||
@dataclass
|
||
class PhysicalRequest:
|
||
pa: int # 51-bit Physical Address
|
||
nbytes: int # transfer size for this request
|
||
dst_node: str # target node_id (channel router or aggregated router)
|
||
|
||
|
||
def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
|
||
seg = self._find_segment(la) # la_base <= la < la_base + la_size
|
||
offset = la - seg.la_base
|
||
|
||
if seg.mode == "n_to_one":
|
||
pa = seg.agg_pa_base + offset
|
||
return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]
|
||
|
||
# one_to_one
|
||
requests = []
|
||
per_ch_size = seg.channel_size
|
||
for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
|
||
ch_offset = offset % per_ch_size
|
||
ch_nbytes = nbytes // seg.channel_count
|
||
pa = pa_base + ch_offset
|
||
dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
|
||
requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
|
||
return requests
|
||
```
|
||
|
||
BAAW responsibilities:
|
||
|
||
- Convert logical access → physical request units.
|
||
- Apply mode-dependent fan-out (1:1) or pass-through (n:1).
|
||
- Compute PA and target node.
|
||
|
||
BAAW non-responsibilities:
|
||
|
||
- Performing actual data movement.
|
||
- Executing NOC routing.
|
||
- Simulating bandwidth occupation (downstream components' job).
|
||
|
||
BAAW output is directly usable by the simulator's routing and resource
|
||
model without additional address decoding.
|
||
|
||
#### D-LA5. PE_DMA `handle_command()` change
|
||
|
||
Current (VA-based) flow:
|
||
|
||
```
|
||
DmaReadCmd.src_addr (VA)
|
||
→ MMU.translate(VA) → PA
|
||
→ PhysAddr.decode(PA) → PhysAddr object
|
||
→ resolver.resolve(PhysAddr) → dst_node_id
|
||
→ router.find_path(pe_prefix, dst_node_id) → path
|
||
→ 1 sub-Transaction → fabric inject
|
||
```
|
||
|
||
LA-based flow:
|
||
|
||
```
|
||
DmaReadCmd.src_addr (LA)
|
||
→ BAAW.resolve(LA, nbytes) → list[PhysicalRequest]
|
||
→ for each PhysicalRequest:
|
||
→ router.find_path(pe_prefix, req.dst_node) → path
|
||
→ compute_drain_ns(path, req.nbytes) → drain
|
||
→ sub-Transaction → fabric inject
|
||
→ await all sub-Transactions
|
||
→ pe_txn.done.succeed()
|
||
```
|
||
|
||
Key changes:
|
||
|
||
- MMU reference removed → BAAW resolve.
|
||
- `PhysAddr.decode()` + `resolver.resolve()` → BAAW returns `dst_node`
|
||
directly.
|
||
- 1 request → N parallel requests in 1:1 mode.
|
||
|
||
#### D-LA6. 1:1 mode detail
|
||
|
||
- One logical access → N physical requests (N = `channels_per_pe`).
|
||
- N = `hbm_pseudo_channels / pes_per_cube`.
|
||
- Each request: fully-resolved 51-bit PA, targets a specific channel
|
||
router (`{pe_prefix}.ch_r{channel_id}`).
|
||
- Per-channel link models BW contention.
|
||
- PE_DMA injects N sub-transactions concurrently.
|
||
|
||
Example: `hbm_pseudo_channels=64`, `pes_per_cube=8` → `channels_per_pe=8`.
|
||
PE0 owns ch0-7.
|
||
|
||
```text
|
||
Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
|
||
BAAW segment: {
|
||
la_base: 0x1_0000_0000, la_size: 4096,
|
||
mode: "one_to_one", channel_count: 8,
|
||
pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
|
||
channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
|
||
channel_size: 512,
|
||
}
|
||
|
||
BAAW resolve result (8 requests):
|
||
→ PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
|
||
→ PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
|
||
→ ...
|
||
→ PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")
|
||
|
||
PE_DMA: 8 sub-transactions parallel inject
|
||
per-channel router → hbm_ctrl link (channel_bw_gbs) per channel
|
||
Total effective BW = 8 × channel_bw_gbs
|
||
```
|
||
|
||
Other N values:
|
||
|
||
- `hbm_pseudo_channels=32`, `pes_per_cube=8` → `channels_per_pe=4`,
|
||
4 requests
|
||
- `hbm_pseudo_channels=64`, `pes_per_cube=4` → `channels_per_pe=16`,
|
||
16 requests
|
||
|
||
#### D-LA7. n:1 mode detail
|
||
|
||
- One logical access → one aggregated request.
|
||
- Target: aggregated router → hbm_ctrl (see ADR-0017 D8).
|
||
- Aggregated link BW = `channels_per_pe × channel_bw_gbs`
|
||
(e.g. 8 × 32 = 256 GB/s).
|
||
- Single queue / resource for modelling.
|
||
- No per-channel PA decomposition.
|
||
|
||
```text
|
||
Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
|
||
BAAW segment: {
|
||
la_base: 0x1_0000_0000, la_size: 4096,
|
||
mode: "n_to_one",
|
||
agg_pa_base: PA_agg,
|
||
agg_node_id: "sip0.cube0.pe0.agg_router",
|
||
}
|
||
|
||
BAAW resolve result:
|
||
→ PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")
|
||
|
||
PE_DMA: 1 sub-transaction
|
||
aggregated router → hbm_ctrl link (256 GB/s)
|
||
```
|
||
|
||
#### D-LA8. Kernel model preserved
|
||
|
||
- Kernel still issues single memory ops (`tl.load`, `tl.store`,
|
||
`tl.composite`).
|
||
- LA is the address scheme exposed to kernel code.
|
||
- Channel decomposition / aggregation happens inside PE_DMA's BAAW.
|
||
- Kernel code never sees physical channel information.
|
||
|
||
#### Consequences (LA model, proposed)
|
||
|
||
Positive:
|
||
|
||
- 1:1 vs n:1 semantics live in one place (BAAW).
|
||
- Kernel abstraction preserved — no kernel code changes.
|
||
- Topology-based policy control (mode switch via yaml).
|
||
- Improved simulation-model consistency and debuggability.
|
||
- Segment-based mapping is simpler than page tables; lower overhead.
|
||
|
||
Negative:
|
||
|
||
- Full VA/MMU code refactor required.
|
||
- Request-generation path more complex (N requests in 1:1 mode).
|
||
- Reduced per-channel visibility in n:1 mode.
|
||
- VA-related tests need rewriting.
|
||
|
||
---
|
||
|
||
## Migration Path
|
||
|
||
- **PA → VA** was an extension. PA mode is retained as the PageFault
|
||
fallback inside PE_DMA. Switching does not require removing PA
|
||
code.
|
||
- **VA → LA**, if adopted, is a replacement, not coexistence. See
|
||
D-LA1 for the VA infrastructure removal list. PA fallback inside
|
||
PE_DMA may be retained orthogonally for tests.
|
||
|
||
## Alternatives Considered (LA model)
|
||
|
||
1. **Keep VA + fan-out in MMU**: MMU returns per-channel PAs.
|
||
Rejected: MMU's role would grow beyond translation to request
|
||
decomposition; aggregation (n:1) becomes awkward to express.
|
||
2. **Channel-aware kernel API**: kernels call per-channel load/store
|
||
directly. Rejected: abstraction leakage, portability loss, all
|
||
benchmarks need rewriting.
|
||
3. **Always PA (no LA)**: runtime passes per-channel PA to kernel
|
||
directly. Rejected: incompatible with aggregation; conversion
|
||
timing unclear; channel info leaks to kernel.
|
||
|
||
## Test Requirements
|
||
|
||
### VA model (current, regression)
|
||
|
||
- Cross-PE / cross-cube DMA paths over installed mappings.
|
||
- `MmuMapMsg` / `MmuUnmapMsg` fabric traversal with measured latency.
|
||
- TLB-overhead-per-access timing.
|
||
- PageFault fallback path preserves PA-only behaviour.
|
||
|
||
### LA model (when implemented)
|
||
|
||
- 1:1 mode: same logical access → N per-channel requests.
|
||
- n:1 mode: same logical access → 1 aggregated request.
|
||
- Bandwidth equivalence between modes for identical workload.
|
||
- 1:1 mode: per-channel contention modelled correctly.
|
||
- n:1 mode: aggregated bandwidth correctly reflected.
|
||
- Kernel code unchanged across mode switch.
|
||
- BAAW segment install / uninstall correctness.
|
||
- Multiple tensors in distinct segments do not collide.
|
||
|
||
## Implementation Order (LA, when scheduled)
|
||
|
||
1. LA type (`policy/address/la_allocator.py`).
|
||
2. BAAW segment table (`policy/address/baaw.py`).
|
||
3. `BaawSegmentInstallMsg` (`runtime_api/kernel.py`).
|
||
4. PE_DMA BAAW integration (`components/builtin/pe_dma.py`
|
||
`handle_command()`).
|
||
5. RuntimeContext: LA alloc + segment install
|
||
(`runtime_api/context.py`).
|
||
6. `Tensor.va_base` → `Tensor.la_base` (`runtime_api/tensor.py`).
|
||
7. Remove VA/MMU code.
|
||
8. Remove `pe_mmu` from `topology.yaml`; add mapping mode settings.
|
||
9. Test migration:
|
||
|
||
| Test file | Action |
|
||
|-----------|--------|
|
||
| `tests/test_mmu_component.py` | Remove → BAAW segment install tests |
|
||
| `tests/test_mmu_fabric.py` | Remove → BAAW + fabric integration tests |
|
||
| `tests/test_pe_mmu.py` | Remove |
|
||
| `tests/test_va_allocator.py` | Replace with LA allocator tests |
|
||
| `tests/test_va_integration.py` | Replace with LA + BAAW integration tests |
|
||
| `tests/test_va_offset.py` | Replace with LA offset tests |
|
||
|
||
## Links
|
||
|
||
- ADR-0007 (runtime_api vs sim_engine boundaries)
|
||
- ADR-0008 (tensor deployment)
|
||
- ADR-0009 (kernel execution)
|
||
- ADR-0014 (PE-internal execution model)
|
||
- ADR-0015 (component port/wire model)
|
||
- ADR-0017 (Cube NOC and HBM connectivity — LA model topology consumer)
|
||
- ADR-0013 (Verification strategy — V1 PA tagging)
|
||
- SPEC R2 (latency by traversal), R10 (memory addressing)
|