ADR: introduce docs/history/, merge 0011+0018, prune migration cruft

- CLAUDE.md: add ADR Lifecycle subsection (superseded → docs/history/,
  immutable numbering, no renumber)
- ADR-0011: merge ADR-0018 content as "Address Model: LA" section
  alongside PA / VA; status notes VA model is currently implemented
- ADR-0018 / 0029 / 0031: moved to docs/history/ with status updates
  (0018 merged into 0011, 0029 superseded by 0032, 0031 absorbed
  into 0001 rev 2)
- ADR-0019: rewrite Context as PE-HBM connectivity decision
  (self-contained, no LA model framing)
- ADR-0019/0020/0021/0023/0025/0027: Status Proposed → Accepted
  (code verified) and prune Implementation Notes / Affected files /
  Test strategy / "현재 상태" sub-sections describing pre-impl state
- ADR-0024/0026: same migration-flavor cleanup; 0026 also drops D6
  Migration and D8 docs-update sub-decisions
- ADR-0030: status simplified (blocker ADR-0031 now superseded)
- SPEC.md: R10 + §0.2 reflect PA / VA / LA model names
- ADR-0008/0012/0013: refresh ADR-0011 subtitle in Links

21 files changed, 553 insertions(+), 1290 deletions(-).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-19 11:42:45 -07:00
parent ecc57d050d
commit 22fd0d2b9d
23 changed files with 553 additions and 1290 deletions
@@ -1,95 +1,514 @@
# ADR-0011: Memory Addressing — PA-first with VA/MMU Extension
# ADR-0011: Memory Addressing — PA / VA / LA Address Models
## Status
Accepted (Phase 1 VA/MMU implemented)
Accepted.
- **VA model: currently implemented (default).**
- PA model: implemented as PageFault fallback in PE_DMA.
- LA model: proposed, not implemented.
## Context
A realistic system uses host-side virtual addressing and an MMU/IOMMU-style
translation path for DMA: host allocates physical memory at PE level, maps it
into a virtual address space, installs mappings, and DMA requests use virtual
addresses that are translated to physical addresses.
KernBench's address model evolved through three design points, each
addressing a limitation of the previous. This ADR documents all three
in one place because future implementation work selects among them.
The PA-only model (Phase 0) was insufficient for running standard Triton kernels
that use `base_addr + offset` patterns on sharded tensors — each PE's shard has
a different PA, but the kernel needs a single contiguous address space.
### PA-only baseline
Phase 0 of KernBench treated all device memory operations
(MemoryRead/MemoryWrite) as raw physical-address transfers. No
host-side virtual addressing, no MMU/IOMMU translation. Allocators
returned PA mappings; DMA requests carried PA directly.
This was sufficient for early correctness/latency work but
insufficient for running standard Triton kernels that use
`base_addr + offset` patterns on sharded tensors: each PE's shard
has a different PA, but the kernel needs a single contiguous address
space to compute offsets.
### Why VA/MMU (current default)
A realistic system uses host-side virtual addressing and an
MMU/IOMMU-style translation path for DMA: the host allocates physical
memory at PE level, maps it into a virtual address space, installs
mappings, and DMA requests use virtual addresses that are translated
to physical addresses.
Adopting this model lets kernels use `base_addr + offset` over a
contiguous VA range while the device-side MMU translates each access
to the appropriate PA.
### Why LA/BAAW (proposed)
VA/MMU treats HBM as a single backing space. KernBench needs to
explore architectures where HBM is composed of multiple pseudo
channels in parallel:
- CUBE's HBM has 32 or 64 pseudo channels.
- In a PE-Local-HBM model, each PE is assigned N pseudo channels
(N = `hbm_pseudo_channels / pes_per_cube`).
- Per-channel BW (e.g. 32 GB/s) determines aggregate PE BW
(N × per-channel).
Two channel-mapping modes need to be modelable:
- **1:1 mode** — one logical access → N per-channel requests.
Precise per-channel BW contention modelling.
- **n:1 mode (default)** — one logical access → one aggregated
request. Channels are assumed to interleave; aggregated BW model.
VA's `tl.load(va_ptr)` produces a single DMA request to a single
target. Decomposing that into per-channel requests inside PE_DMA
requires the address layer to be aware of channels. This is the
role of the LA (Logical Address) abstraction with BAAW
(Logical-to-Physical Mapping Unit).
Core requirements driving the LA design:
- PE_DMA → HBM_CTRL effective bandwidth semantics must be identical
in both modes (only request shape and resource model differ).
- Kernel programming model is unchanged — physical channel
information is never exposed to kernel code.
- Mode switch is a topology-level configuration.
### Design space summary
| Model | Status | Key idea |
|-------|--------|----------|
| PA | fallback (implemented) | Direct physical addressing, no translation |
| VA | current default (implemented) | Per-tensor contiguous VA range; MMU translates per access |
| LA | proposed | LA + BAAW resolves to (PA, channel); supports 1:1 and n:1 channel mapping modes |
---
## Decision
### D1. Phase 0 model is PA-only (original, retained as fallback)
This ADR defines three address models. At any given time the system
operates in exactly one model. Selection is topology- / configuration-
driven; coexistence within one simulation run is not required.
- All device memory accesses (MemoryRead/MemoryWrite) operate on device physical
addresses (PA) plus size.
- PA-only mode remains functional via PageFault fallback in PE_DMA.
---
### D2. Allocation produces PA mappings
### Address Model: PA (Physical Address) — fallback
Device allocation selects PE-local memory regions and returns PA mappings
sufficient to execute kernels and issue DMA requests.
#### D-PA1. PA-only semantics
### D3. Phase 1: VA/MMU layer (implemented)
- All device memory accesses (MemoryRead/MemoryWrite) operate on
device physical addresses (PA) plus size.
- PA-only mode remains functional via the PageFault fallback path in
PE_DMA: if a DMA src/dst address has no MMU mapping, PE_DMA treats
the value as a PA directly.
#### D3.1 Virtual Address Model
#### D-PA2. Allocation produces PA mappings
Device allocation selects PE-local memory regions and returns PA
mappings sufficient to execute kernels and issue DMA requests.
PA model is retained primarily for backward compatibility with PA-only
tests and as the underlying physical layer that VA / LA models resolve
into.
---
### Address Model: VA (Virtual Address with MMU) — current default
#### D-VA1. Virtual Address Model
- Each tensor gets a single contiguous VA range (`TensorHandle.va_base`).
- `TensorShard` does NOT carry a `va` field — shard VA is derived as
`va_base + offset_bytes`.
- Kernels receive `va_base` as their pointer argument (via `TensorArg.va_base`).
- Kernels receive `va_base` as their pointer argument (via
`TensorArg.va_base`).
- `DmaReadCmd.src_addr` and `DmaWriteCmd.dst_addr` carry VA (not PA).
#### D3.2 PE_MMU Component
#### D-VA2. PE_MMU Component
- Hybrid design: SimPy component (inbox for MmuMapMsg) + utility (synchronous
`translate()` called by PE_DMA).
- Page-aligned dict lookup for O(1) VAPA translation.
- Hybrid design: SimPy component (inbox for `MmuMapMsg`) + utility
(synchronous `translate()` called by PE_DMA).
- Page-aligned dict lookup for O(1) VAPA translation.
- `tlb_overhead_ns` configurable per-access latency.
- PageFault fallback: if VA has no mapping, PE_DMA treats it as PA directly
(backward compatibility with PA-only tests).
- PageFault fallback: if VA has no mapping, PE_DMA treats it as PA
directly (preserves PA model for backward compatibility).
#### D3.3 Mapping Installation
#### D-VA3. Mapping Installation
- `MmuMapMsg` traverses the fabric: Host → PCIE_EP → IO_CPU (cube fan-out) →
M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured end-to-end.
- `MmuMapMsg.target_sips` controls SIP-level routing to prevent cross-SIP
mapping contamination for replicated tensors.
- `MmuMapMsg` traverses the fabric: Host → PCIE_EP → IO_CPU (cube
fan-out) → M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured
end-to-end.
- `MmuMapMsg.target_sips` controls SIP-level routing to prevent
cross-SIP mapping contamination for replicated tensors.
- Mapping strategy based on `DPPolicy.cube`:
- **Replicate** (`cube="replicate"`): per-(sip, cube) local mapping only.
Each cube's PEs see only their local PA. No cross-cube mapping installed.
- **Sharded** (`cube="column_wise"`, etc.): broadcast all shard mappings to all
target cubes. Enables cross-PE and cross-cube DMA.
- **Replicate** (`cube="replicate"`): per-(sip, cube) local mapping
only. Each cube's PEs see only their local PA. No cross-cube
mapping installed.
- **Sharded** (`cube="column_wise"`, etc.): broadcast all shard
mappings to all target cubes. Enables cross-PE and cross-cube
DMA.
#### D3.4 Tensor Lifecycle
#### D-VA4. Tensor Lifecycle
- `del tensor` triggers automatic cleanup via `Tensor.__del__` + `weakref` to
RuntimeContext. Sends `MmuUnmapMsg` through fabric, returns VA and PA space.
- `del tensor` triggers automatic cleanup via `Tensor.__del__` +
`weakref` to `RuntimeContext`. Sends `MmuUnmapMsg` through fabric,
returns VA and PA space.
- `with RuntimeContext(...) as ctx:` provides scope-based bulk cleanup.
- `RuntimeContext._tensors` uses `weakref.ref` to avoid preventing GC.
- `PEMemAllocator` uses free-list with coalescing (not bump allocator).
- `VirtualAllocator` uses free-list with coalescing for VA space.
#### D3.5 Allocators
#### D-VA5. Allocators
- `VirtualAllocator`: device-wide VA space, page-aligned alloc/free with
- `VirtualAllocator`: device-wide VA space, page-aligned alloc/free
with coalescing.
- `PEMemAllocator`: per-PE HBM/TCM, free-list based alloc/free with
coalescing.
- `PEMemAllocator`: per-PE HBM/TCM, free-list based alloc/free with coalescing.
- Page size configurable via `topology.yaml` pe_mmu attrs (default 4096).
- Page size configurable via `topology.yaml` `pe_mmu` attrs
(default 4096).
---
#### Consequences (VA model)
## Consequences
- Triton kernels use `base_addr + offset` patterns naturally on sharded tensors.
- All latency remains explicit via graph traversal, including MMU mapping
installation and per-access TLB overhead.
- Triton kernels use `base_addr + offset` patterns naturally on
sharded tensors.
- All latency remains explicit via graph traversal, including MMU
mapping installation and per-access TLB overhead.
- PA-only mode retained as fallback (PageFault → treat as PA).
- Benchmark parameter renamed `ctx``torch` for PyTorch code compatibility.
- IPCQ and other fixed-address resources bypass MMU (use PA directly).
---
### Address Model: LA (Logical Address with BAAW) — proposed
LA replaces VA when channel-level HBM modelling is required.
Adopting this model removes the VA/MMU infrastructure (D-LA1 lists the
removed artifacts). Coexistence with VA in the same run is not a goal.
#### D-LA1. LA introduction — replaces VA infrastructure
LA is the sole address space used by kernel code (`tl.load`,
`tl.store`, `tl.composite`). Properties:
- Can map a Tensor to a contiguous logical space (like VA).
- Expresses `(logical buffer + offset)`.
- Does NOT contain physical channel information directly.
- Stays as an intermediate abstraction until physical resolution.
LA address space:
| Item | Value |
|------|-------|
| LA start | `0x1_0000_0000` (4 GB, preserves former VA start) |
| LA space size | 64 GB per PE |
| Alignment unit | segment (see D-LA3) |
LA is PE-local: different PEs may use the same LA value; BAAW segment
tables differ → they resolve to different PAs.
VA infrastructure removed when LA is adopted:
| Removed | Replacement |
|---------|-------------|
| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (same free-list approach, renamed) |
| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (inside PE_DMA) |
| `components/builtin/pe_mmu.py` (PeMmuComponent) | Removed — BAAW is internal PE_DMA logic, not a separate component |
| `runtime_api/kernel.py`: `MmuMapMsg`, `MmuUnmapMsg` | `BaawSegmentInstallMsg` |
| `runtime_api/context.py`: VA alloc + MMU install | LA alloc + BAAW segment install |
| `runtime_api/tensor.py`: `va_base` | `la_base` |
| `topology.yaml`: `pe_mmu` component entry | Removed |
#### D-LA2. Mapping mode setting
Topology-level (cube) configuration:
```yaml
cube:
memory_map:
hbm_mapping_mode: n_to_one # one_to_one | n_to_one
hbm_pseudo_channels: 64 # total pseudo channel count
hbm_channels_per_pe: 8 # per-PE local channel count
hbm_channel_bw_gbs: 32.0 # per-channel bandwidth
```
Consumed by the graph compiler (topology builder) and BAAW
initialisation.
#### D-LA3. Segment and BAAW
Segment partitions the LA space; each segment maps to a specific HBM
channel or channel group. Created at tensor deploy time by the runtime
allocator. BAAW resolves LA → physical request(s) using the segment
table.
```python
@dataclass
class BaawSegment:
la_base: int # segment start LA
la_size: int # segment size (bytes)
mode: str # "one_to_one" | "n_to_one"
# 1:1 mode fields
channel_count: int # channels assigned to this segment (e.g. 8)
pa_bases: list[int] # per-channel PA bases (len = channel_count)
channel_ids: list[int] # per-channel logical IDs (e.g. [0..7])
channel_size: int # per-channel size (la_size // channel_count)
# n:1 mode fields
agg_pa_base: int # aggregated PA base
agg_node_id: str # aggregated router node_id
```
Segment lifecycle:
1. **Allocate** (tensor deploy): RuntimeContext allocates LA from LA
allocator. PEMemAllocator allocates per-channel PA (1:1) or
aggregated PA (n:1). `BaawSegmentInstallMsg` registers the segment
with PE_DMA.
2. **Use** (kernel run): kernel `tl.load(la_ptr)` → `DmaReadCmd
(src_addr=LA)`. PE_DMA's BAAW front-end looks up the segment and
converts to PA(s).
3. **Free** (tensor free): segment removed from table; LA and PA
returned.
#### D-LA4. BAAW resolution logic
BAAW is a front-end stage inside PE_DMA, not a separate SimPy
component. Synchronous address-resolution logic executed at the start
of PE_DMA's `handle_command()`.
Input: `(LA, nbytes)`. Output:
- **1:1 mode**: `list[PhysicalRequest]` — one per channel.
- **n:1 mode**: single `PhysicalRequest`.
```python
@dataclass
class PhysicalRequest:
pa: int # 51-bit Physical Address
nbytes: int # transfer size for this request
dst_node: str # target node_id (channel router or aggregated router)
def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
seg = self._find_segment(la) # la_base <= la < la_base + la_size
offset = la - seg.la_base
if seg.mode == "n_to_one":
pa = seg.agg_pa_base + offset
return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]
# one_to_one
requests = []
per_ch_size = seg.channel_size
for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
ch_offset = offset % per_ch_size
ch_nbytes = nbytes // seg.channel_count
pa = pa_base + ch_offset
dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
return requests
```
BAAW responsibilities:
- Convert logical access → physical request units.
- Apply mode-dependent fan-out (1:1) or pass-through (n:1).
- Compute PA and target node.
BAAW non-responsibilities:
- Performing actual data movement.
- Executing NOC routing.
- Simulating bandwidth occupation (downstream components' job).
BAAW output is directly usable by the simulator's routing and resource
model without additional address decoding.
#### D-LA5. PE_DMA `handle_command()` change
Current (VA-based) flow:
```
DmaReadCmd.src_addr (VA)
→ MMU.translate(VA) → PA
→ PhysAddr.decode(PA) → PhysAddr object
→ resolver.resolve(PhysAddr) → dst_node_id
→ router.find_path(pe_prefix, dst_node_id) → path
→ 1 sub-Transaction → fabric inject
```
LA-based flow:
```
DmaReadCmd.src_addr (LA)
→ BAAW.resolve(LA, nbytes) → list[PhysicalRequest]
→ for each PhysicalRequest:
→ router.find_path(pe_prefix, req.dst_node) → path
→ compute_drain_ns(path, req.nbytes) → drain
→ sub-Transaction → fabric inject
→ await all sub-Transactions
→ pe_txn.done.succeed()
```
Key changes:
- MMU reference removed → BAAW resolve.
- `PhysAddr.decode()` + `resolver.resolve()` → BAAW returns `dst_node`
directly.
- 1 request → N parallel requests in 1:1 mode.
#### D-LA6. 1:1 mode detail
- One logical access → N physical requests (N = `channels_per_pe`).
- N = `hbm_pseudo_channels / pes_per_cube`.
- Each request: fully-resolved 51-bit PA, targets a specific channel
router (`{pe_prefix}.ch_r{channel_id}`).
- Per-channel link models BW contention.
- PE_DMA injects N sub-transactions concurrently.
Example: `hbm_pseudo_channels=64`, `pes_per_cube=8` → `channels_per_pe=8`.
PE0 owns ch0-7.
```text
Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
BAAW segment: {
la_base: 0x1_0000_0000, la_size: 4096,
mode: "one_to_one", channel_count: 8,
pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
channel_size: 512,
}
BAAW resolve result (8 requests):
→ PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
→ PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
→ ...
→ PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")
PE_DMA: 8 sub-transactions parallel inject
per-channel router → hbm_ctrl link (channel_bw_gbs) per channel
Total effective BW = 8 × channel_bw_gbs
```
Other N values:
- `hbm_pseudo_channels=32`, `pes_per_cube=8` → `channels_per_pe=4`,
4 requests
- `hbm_pseudo_channels=64`, `pes_per_cube=4` → `channels_per_pe=16`,
16 requests
#### D-LA7. n:1 mode detail
- One logical access → one aggregated request.
- Target: aggregated router → hbm_ctrl (see ADR-0019).
- Aggregated link BW = `channels_per_pe × channel_bw_gbs`
(e.g. 8 × 32 = 256 GB/s).
- Single queue / resource for modelling.
- No per-channel PA decomposition.
```text
Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
BAAW segment: {
la_base: 0x1_0000_0000, la_size: 4096,
mode: "n_to_one",
agg_pa_base: PA_agg,
agg_node_id: "sip0.cube0.pe0.agg_router",
}
BAAW resolve result:
→ PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")
PE_DMA: 1 sub-transaction
aggregated router → hbm_ctrl link (256 GB/s)
```
#### D-LA8. Kernel model preserved
- Kernel still issues single memory ops (`tl.load`, `tl.store`,
`tl.composite`).
- LA is the address scheme exposed to kernel code.
- Channel decomposition / aggregation happens inside PE_DMA's BAAW.
- Kernel code never sees physical channel information.
#### Consequences (LA model, proposed)
Positive:
- 1:1 vs n:1 semantics live in one place (BAAW).
- Kernel abstraction preserved — no kernel code changes.
- Topology-based policy control (mode switch via yaml).
- Improved simulation-model consistency and debuggability.
- Segment-based mapping is simpler than page tables; lower overhead.
Negative:
- Full VA/MMU code refactor required.
- Request-generation path more complex (N requests in 1:1 mode).
- Reduced per-channel visibility in n:1 mode.
- VA-related tests need rewriting.
---
## Migration Path
- **PA → VA** was an extension. PA mode is retained as the PageFault
fallback inside PE_DMA. Switching does not require removing PA
code.
- **VA → LA**, if adopted, is a replacement, not coexistence. See
D-LA1 for the VA infrastructure removal list. PA fallback inside
PE_DMA may be retained orthogonally for tests.
## Alternatives Considered (LA model)
1. **Keep VA + fan-out in MMU**: MMU returns per-channel PAs.
Rejected: MMU's role would grow beyond translation to request
decomposition; aggregation (n:1) becomes awkward to express.
2. **Channel-aware kernel API**: kernels call per-channel load/store
directly. Rejected: abstraction leakage, portability loss, all
benchmarks need rewriting.
3. **Always PA (no LA)**: runtime passes per-channel PA to kernel
directly. Rejected: incompatible with aggregation; conversion
timing unclear; channel info leaks to kernel.
## Test Requirements
### VA model (current, regression)
- Cross-PE / cross-cube DMA paths over installed mappings.
- `MmuMapMsg` / `MmuUnmapMsg` fabric traversal with measured latency.
- TLB-overhead-per-access timing.
- PageFault fallback path preserves PA-only behaviour.
### LA model (when implemented)
- 1:1 mode: same logical access → N per-channel requests.
- n:1 mode: same logical access → 1 aggregated request.
- Bandwidth equivalence between modes for identical workload.
- 1:1 mode: per-channel contention modelled correctly.
- n:1 mode: aggregated bandwidth correctly reflected.
- Kernel code unchanged across mode switch.
- BAAW segment install / uninstall correctness.
- Multiple tensors in distinct segments do not collide.
## Implementation Order (LA, when scheduled)
1. LA type (`policy/address/la_allocator.py`).
2. BAAW segment table (`policy/address/baaw.py`).
3. `BaawSegmentInstallMsg` (`runtime_api/kernel.py`).
4. PE_DMA BAAW integration (`components/builtin/pe_dma.py`
`handle_command()`).
5. RuntimeContext: LA alloc + segment install
(`runtime_api/context.py`).
6. `Tensor.va_base` → `Tensor.la_base` (`runtime_api/tensor.py`).
7. Remove VA/MMU code.
8. Remove `pe_mmu` from `topology.yaml`; add mapping mode settings.
9. Test migration:
| Test file | Action |
|-----------|--------|
| `tests/test_mmu_component.py` | Remove → BAAW segment install tests |
| `tests/test_mmu_fabric.py` | Remove → BAAW + fabric integration tests |
| `tests/test_pe_mmu.py` | Remove |
| `tests/test_va_allocator.py` | Replace with LA allocator tests |
| `tests/test_va_integration.py` | Replace with LA + BAAW integration tests |
| `tests/test_va_offset.py` | Replace with LA offset tests |
## Links
- ADR-0007 (runtime_api vs sim_engine boundaries)
@@ -97,4 +516,6 @@ sufficient to execute kernels and issue DMA requests.
- ADR-0009 (kernel execution)
- ADR-0014 (PE-internal execution model)
- ADR-0015 (component port/wire model)
- SPEC R2 (latency by traversal)
- ADR-0019 (NOC + per-channel HBM connectivity — LA model topology
consumer)
- SPEC R2 (latency by traversal), R10 (memory addressing)