# ADR-0048: Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator ## Status Accepted (2026-05-22). Pins down the free-list algorithm, page alignment, and coalescing rules used by `policy/address/allocator.py`'s `_FreeList` / `PEMemAllocator` and `va_allocator.py`'s `VirtualAllocator`. ADR-0001 (PhysAddr layout) and ADR-0011 (PA/VA/LA models) define the address schemes; the **allocation algorithms** had no ADR-level coverage. ## First action ### `_FreeList(capacity)` On construction: `self._capacity = capacity`, `self._used = 0`, `self._free = [(0, capacity)]`. The first act is **establishing the entire region as one free block** — the tuple `(offset=0, size=capacity)` is the sole entry in the free list. ### `PEMemAllocator(sip_id, die_id, pe_id, cfg)` On construction, builds two `_FreeList`s: - `self._hbm = _FreeList(cfg.hbm_slice_bytes)` — the size of this PE's HBM slice (`hbm_bytes_per_cube // hbm_slices_per_cube`). - `self._tcm = _FreeList(cfg.tcm_allocatable_bytes)` — equals `tcm_bytes_per_pe - tcm_scheduler_reserved_bytes` (the scheduler reservation is pre-deducted). So PEMemAllocator's first act is **constructing single-free-block HBM-slice and TCM regions for this PE**. ### `VirtualAllocator(va_base, va_size, page_size=2*1024*1024)` On construction: `self._va_base = va_base`, `self._va_size = va_size`, `self._page_size = page_size`, `self._used = 0`, `self._free = [(va_base, va_size)]`. The first act is **establishing one block from va_base to va_size and stashing page_size**. ## Context `runtime_api/context.py::_ensure_allocators` builds the allocator set in these stages: 1. Read `hbm_total_gb_per_cube`, `hbm_slices_per_cube`, `tcm_size_mb`, per-target_device SIP range, etc. from `spec`. 2. Pack everything into a frozen `AddressConfig`. 3. For every combination in the target SIP range × cubes × PEs, construct one `PEMemAllocator(sip, cube, pe, cfg)` instance. 4. Construct one `VirtualAllocator(va_base=0x1_0000_0000, va_size=64 GiB, page_size=pe_mmu.page_size)`. Allocator responsibilities: - **PEMemAllocator**: PA-space allocation in the PE-local HBM slice / TCM (including PhysAddr encoding). - **VirtualAllocator**: device-wide VA allocation, page-aligned. `RuntimeContext._create_tensor` then pushes VA → PA mappings to components via `MmuMapMsg`. These algorithms are: - **First-fit**, kept simple. - The free-block list is **sorted by start offset**. - On `free()`, **adjacent blocks coalesce**. The rationale was not documented anywhere, so when someone asks "why not best-fit?", "why not a buddy allocator?", "why does partial-overlap free pass silently?", there was no anchor to answer from. This ADR provides it. ## Decision ### D1. `_FreeList` — offset-keyed first-fit + coalescing `policy/address/allocator.py::_FreeList`: - Internal representation: `list[tuple[int, int]] = [(start_offset, size), ...]` — sorted by start offset. - `alloc(nbytes)`: 1. Iterate the free list from the front (first-fit). 2. Take from the first block with `size >= nbytes`. 3. Exact match → drop the block; otherwise shrink it to `(start + nbytes, size - nbytes)`. 4. `_used += nbytes`; return the taken `start`. 5. If no block fits, `AllocationError("overflow ... largest free block ...")`. - `free(offset, nbytes)`: 1. `_used -= nbytes`. 2. `bisect_left(self._free, (offset,))` finds the insertion index. 3. If adjacent to the previous block (`prev_start + prev_size == offset`), merge. 4. If adjacent to the next block (`offset + nbytes == next_start`), merge. 5. Insert the coalesced range at the right sorted position. This algorithm is weaker than best-fit / buddy on fragmentation, but the simulator's workload (mostly stack-like deploy/free) tolerates it. If the workload shape changes, D1 is a supersession candidate. ### D2. Partial-overlap free is **not** validated `_FreeList.free(offset, nbytes)` trusts the caller to pass the exact `(offset, nbytes)`. It does **not** verify: - That the range was actually allocated. - That the range does not overlap another allocated region. Reason: in a simulator context, callers always store the return value of `alloc()` and pass it back to `free()` — there is no external user input. Adding a safety check would cost O(N) per free and impact simulation wall-clock. If this trust model breaks (e.g., a code path lets two tensors point at the same PA), this ADR must be revisited. ### D3. `PEMemAllocator` — two channels for HBM/TCM `PEMemAllocator(sip_id, die_id, pe_id, cfg)` holds two `_FreeList`s: - `_hbm`: size `cfg.hbm_slice_bytes`. - `_tcm`: size `cfg.tcm_allocatable_bytes` (= `tcm_bytes_per_pe - tcm_scheduler_reserved_bytes`). `alloc_hbm(nbytes) -> PhysAddr`: - `_hbm.alloc(nbytes)` → offset. - `PhysAddr.pe_hbm_addr(sip_id, die_id, pe_id, pe_local_hbm_offset=offset, slice_size_bytes=cfg.hbm_slice_bytes)`. - Failure raises `AllocationError("HBM overflow ...")`. `free_hbm(pa, nbytes)`: - Recover PE-local offset via `pa.hbm_offset - pe_id * cfg.hbm_slice_bytes`. - `_hbm.free(offset, nbytes)`. `alloc_tcm(nbytes) -> PhysAddr`: similar; uses `PhysAddr.pe_tcm_addr`. `free_tcm(pa, nbytes)`: uses `pa.sub_offset` directly (TCM's PE-local offset equals its sub_offset). The allocator does not see the scheduler-reserved TCM region (`cfg.tcm_scheduler_reserved_bytes`) — it is pre-subtracted from the `_tcm` capacity. This is consistent with ADR-0014's PE_SCHEDULER internal-buffer reservation. ### D4. `VirtualAllocator` — page-aligned first-fit + coalescing `policy/address/va_allocator.py::VirtualAllocator`: - Internal representation: same sorted `list[tuple[int, int]]` as `_FreeList`. Initially `[(va_base, va_size)]`. - `_align_up(nbytes) = ceil(nbytes / page_size) * page_size`. - `alloc(nbytes) -> int`: 1. `aligned = _align_up(nbytes)`. 2. First-fit a block with `size >= aligned`. 3. Take `aligned` from the block's front; remove if exact. 4. `_used += aligned`. Return the block's `start` (which is page- aligned). 5. Failure → `VaAllocationError`. - `free(va, nbytes)`: free `_align_up(nbytes)` worth. Coalesces with the same algorithm as `_FreeList`. `page_size` has different defaults in two places: - `VirtualAllocator.__init__`'s parameter default: `2 MiB`. Direct-call tests receive this. - `RuntimeContext._ensure_allocators` when constructing the instance: `pe_mmu.attrs.get("page_size", 4096)` — uses `topology.yaml`'s `pe_mmu.attrs.page_size` if set, else falls back to `4 KiB`. The two defaults differ on purpose: `VirtualAllocator`'s standalone default (`2 MiB`) aligns with ADR-0039's PE_MMU stopgap default for direct-test ergonomics; the context fallback (`4 KiB`) is the safe minimum when `topology.yaml` doesn't specify a page size. The production path is always the latter (via `_ensure_allocators`), and when `topology.yaml` sets `page_size`, that value flows consistently into both the MMU and the VA allocator. If consistency breaks (e.g., VirtualAllocator instantiated with a page_size different from PE_MMU's), MMU `map()` falls into the sub-page region mode (ADR-0039 D3). VA range defaults: `va_base = 0x1_0000_0000` (= 4 GiB), `va_size = 64 GiB`. These are hardcoded in `_ensure_allocators` and have no semantic meaning in ADR-0011's VA model — they simply reserve enough device-wide space without colliding with host code. ### D5. Lifecycle of allocator instances - `RuntimeContext._ensure_allocators` is lazy — called on the first `_create_tensor`. - The allocator dict (`self._allocators`) lives for the RuntimeContext's lifetime. A second deploy in the same process does not construct new objects. - `RuntimeContext.cleanup()` walks living tensors and calls `_free_tensor()`, which issues MMU unmaps + `va_allocator.free` + `pemem_allocator.free_hbm` — restoring the free lists. A subsequent RuntimeContext starts fresh. This per-RuntimeContext isolation guarantees deterministic deploy → cleanup → deploy sequences within a single process. ### D6. Allocator failure raises (no silent OOM) Both `_FreeList.alloc` and `VirtualAllocator.alloc` raise `AllocationError` / `VaAllocationError` when no block fits. The message includes "required size + largest available block" to distinguish fragmentation from true OOM. A silent fallback (e.g., allocating only as much as the largest free block) is strictly forbidden — a partially-allocated tensor reaching SimPy would cause routing / DMA to see incorrect PAs and silently corrupt simulation results. ### D7. One allocator per address space Physical address spaces are separated by PhysAddr sub-units (ADR-0001 D2.3); each sub-unit gets its own allocator instance: - HBM slice → `PEMemAllocator._hbm`. - PE TCM → `PEMemAllocator._tcm`. - (Currently unused) M_CPU local memory, CUBE SRAM → would need their own allocators. Today these are handled as IPCQ-only slots (ADR-0023 D9.7) and do not share PA space, so no free-list exists for them. When a cube-level SRAM allocator is needed, `_FreeList(cfg.sram_bytes_per_cube)` is added per-cube (`cfg.sram_bytes_per_cube` is already defined in `AddressConfig` — the data model is ready). ## Alternatives Considered ### A1. Best-fit / buddy allocator Rejected (currently). The workload's alloc/free pattern is stack-like (deploy order ≈ free order), so first-fit + coalescing controls fragmentation well enough. If long-running fragmentation appears in LLM kernel sweeps, a buddy-allocator ADR will replace D1. ### A2. Add partial-overlap free validation Rejected. D2's trust model plus the O(N) per-free cost makes this unattractive. A debug mode (e.g., `KERNBENCH_DEBUG` env var) that enables the check could be added later. ### A3. A unified allocator for VA and PA Rejected. VA space (64 GiB device-wide) and PA space (per-slice ~6 GiB) have different semantic dimensions — VA is the kernel's view, PA is the device sub-unit's view. ADR-0011's VA model (MMU maps between the two) calls for separated allocators. ### A4. Multi-tier page sizes (large pages + small pages) Rejected (currently). A single page size (2 MiB) matches LLM kernel tensor sizes (a few MiB to GiB); smaller mappings are absorbed by ADR-0039 D3's sub-page region mode. Multi-tier paging would require extending the MMU model itself — a separate ADR candidate. ## Consequences - The allocator algorithm is pinned at ADR level (D1, D3, D4), so any future simulation scenario hitting fragmentation has a clear "we're using first-fit + coalescing" anchor to inspect. - D2's trust model is explicit, so any future code path that exposes alloc/free to direct user input will trigger this ADR's supersession early. - D7's one-allocator-per-sub-unit mapping is on record, so when M_CPU or SRAM need their own free-list, the addition point is obvious. - D4 captures the page_size dual-default and its production path (`_ensure_allocators` always wins), letting future `topology.yaml` `page_size` changes be assessed against ADR-0039's stopgap interaction quickly.