Files
kernbench2/docs/adr/ADR-0048-mem-allocator-algorithms.md
ywkang 9a02955770 adr: add ADR-0046-0049 — close G4 coverage gaps from /report
Documents four cross-cutting surfaces that previously had no ADR backing,
each surfaced as a G4 candidate by /report:

- 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates
  all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...),
  the two execution modes (command-list vs greenlet runner), scratch
  allocator semantics, dispatch-overhead model, and the kernel registry.

- 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group
  (backend="ahbm") install path. world_size priority (algorithm >
  defaults > topology), the 4-step init sequence (load ccl.yaml, import
  algorithm module, derive world_size, install SFR + IPCQ), greenlet-
  local rank registry, all_reduce dispatch via _defer_wait, barrier
  no-op rationale, and the explicit list of unsupported dist.* APIs.

- 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator
  free-list semantics. Offset-keyed first-fit with coalescing, the
  no-validation trust model for free(), HBM/TCM channel separation,
  page-aligned VA allocation, the page_size dual-default
  (VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and
  one-allocator-per-sub-unit rule.

- 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog.
  H2D / D2H / PE DMA categories with their exact cube-index choices,
  the 32 KiB reference size, the 5-point utilization sweep, the
  formula vs actual column meanings, automatic invariant checks
  (monotonicity, D2H >= H2D, best < worst), per-case GraphEngine
  isolation, and the human-readable (not machine-parsable) output
  contract.

Bilingual pair verifier passes for all four EN/KO pairs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:25:04 -07:00

11 KiB
Raw Permalink Blame History

ADR-0048: Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator

Status

Accepted (2026-05-22).

Pins down the free-list algorithm, page alignment, and coalescing rules used by policy/address/allocator.py's _FreeList / PEMemAllocator and va_allocator.py's VirtualAllocator. ADR-0001 (PhysAddr layout) and ADR-0011 (PA/VA/LA models) define the address schemes; the allocation algorithms had no ADR-level coverage.

First action

_FreeList(capacity)

On construction: self._capacity = capacity, self._used = 0, self._free = [(0, capacity)]. The first act is establishing the entire region as one free block — the tuple (offset=0, size=capacity) is the sole entry in the free list.

PEMemAllocator(sip_id, die_id, pe_id, cfg)

On construction, builds two _FreeLists:

  • self._hbm = _FreeList(cfg.hbm_slice_bytes) — the size of this PE's HBM slice (hbm_bytes_per_cube // hbm_slices_per_cube).
  • self._tcm = _FreeList(cfg.tcm_allocatable_bytes) — equals tcm_bytes_per_pe - tcm_scheduler_reserved_bytes (the scheduler reservation is pre-deducted).

So PEMemAllocator's first act is constructing single-free-block HBM-slice and TCM regions for this PE.

VirtualAllocator(va_base, va_size, page_size=2*1024*1024)

On construction: self._va_base = va_base, self._va_size = va_size, self._page_size = page_size, self._used = 0, self._free = [(va_base, va_size)]. The first act is establishing one block from va_base to va_size and stashing page_size.

Context

runtime_api/context.py::_ensure_allocators builds the allocator set in these stages:

  1. Read hbm_total_gb_per_cube, hbm_slices_per_cube, tcm_size_mb, per-target_device SIP range, etc. from spec.
  2. Pack everything into a frozen AddressConfig.
  3. For every combination in the target SIP range × cubes × PEs, construct one PEMemAllocator(sip, cube, pe, cfg) instance.
  4. Construct one VirtualAllocator(va_base=0x1_0000_0000, va_size=64 GiB, page_size=pe_mmu.page_size).

Allocator responsibilities:

  • PEMemAllocator: PA-space allocation in the PE-local HBM slice / TCM (including PhysAddr encoding).
  • VirtualAllocator: device-wide VA allocation, page-aligned. RuntimeContext._create_tensor then pushes VA → PA mappings to components via MmuMapMsg.

These algorithms are:

  • First-fit, kept simple.
  • The free-block list is sorted by start offset.
  • On free(), adjacent blocks coalesce.

The rationale was not documented anywhere, so when someone asks "why not best-fit?", "why not a buddy allocator?", "why does partial-overlap free pass silently?", there was no anchor to answer from. This ADR provides it.

Decision

D1. _FreeList — offset-keyed first-fit + coalescing

policy/address/allocator.py::_FreeList:

  • Internal representation: list[tuple[int, int]] = [(start_offset, size), ...] — sorted by start offset.
  • alloc(nbytes):
    1. Iterate the free list from the front (first-fit).
    2. Take from the first block with size >= nbytes.
    3. Exact match → drop the block; otherwise shrink it to (start + nbytes, size - nbytes).
    4. _used += nbytes; return the taken start.
    5. If no block fits, AllocationError("overflow ... largest free block ...").
  • free(offset, nbytes):
    1. _used -= nbytes.
    2. bisect_left(self._free, (offset,)) finds the insertion index.
    3. If adjacent to the previous block (prev_start + prev_size == offset), merge.
    4. If adjacent to the next block (offset + nbytes == next_start), merge.
    5. Insert the coalesced range at the right sorted position.

This algorithm is weaker than best-fit / buddy on fragmentation, but the simulator's workload (mostly stack-like deploy/free) tolerates it. If the workload shape changes, D1 is a supersession candidate.

D2. Partial-overlap free is not validated

_FreeList.free(offset, nbytes) trusts the caller to pass the exact (offset, nbytes). It does not verify:

  • That the range was actually allocated.
  • That the range does not overlap another allocated region.

Reason: in a simulator context, callers always store the return value of alloc() and pass it back to free() — there is no external user input. Adding a safety check would cost O(N) per free and impact simulation wall-clock.

If this trust model breaks (e.g., a code path lets two tensors point at the same PA), this ADR must be revisited.

D3. PEMemAllocator — two channels for HBM/TCM

PEMemAllocator(sip_id, die_id, pe_id, cfg) holds two _FreeLists:

  • _hbm: size cfg.hbm_slice_bytes.
  • _tcm: size cfg.tcm_allocatable_bytes (= tcm_bytes_per_pe - tcm_scheduler_reserved_bytes).

alloc_hbm(nbytes) -> PhysAddr:

  • _hbm.alloc(nbytes) → offset.
  • PhysAddr.pe_hbm_addr(sip_id, die_id, pe_id, pe_local_hbm_offset=offset, slice_size_bytes=cfg.hbm_slice_bytes).
  • Failure raises AllocationError("HBM overflow ...").

free_hbm(pa, nbytes):

  • Recover PE-local offset via pa.hbm_offset - pe_id * cfg.hbm_slice_bytes.
  • _hbm.free(offset, nbytes).

alloc_tcm(nbytes) -> PhysAddr: similar; uses PhysAddr.pe_tcm_addr.

free_tcm(pa, nbytes): uses pa.sub_offset directly (TCM's PE-local offset equals its sub_offset).

The allocator does not see the scheduler-reserved TCM region (cfg.tcm_scheduler_reserved_bytes) — it is pre-subtracted from the _tcm capacity. This is consistent with ADR-0014's PE_SCHEDULER internal-buffer reservation.

D4. VirtualAllocator — page-aligned first-fit + coalescing

policy/address/va_allocator.py::VirtualAllocator:

  • Internal representation: same sorted list[tuple[int, int]] as _FreeList. Initially [(va_base, va_size)].
  • _align_up(nbytes) = ceil(nbytes / page_size) * page_size.
  • alloc(nbytes) -> int:
    1. aligned = _align_up(nbytes).
    2. First-fit a block with size >= aligned.
    3. Take aligned from the block's front; remove if exact.
    4. _used += aligned. Return the block's start (which is page- aligned).
    5. Failure → VaAllocationError.
  • free(va, nbytes): free _align_up(nbytes) worth. Coalesces with the same algorithm as _FreeList.

page_size has different defaults in two places:

  • VirtualAllocator.__init__'s parameter default: 2 MiB. Direct-call tests receive this.
  • RuntimeContext._ensure_allocators when constructing the instance: pe_mmu.attrs.get("page_size", 4096) — uses topology.yaml's pe_mmu.attrs.page_size if set, else falls back to 4 KiB.

The two defaults differ on purpose: VirtualAllocator's standalone default (2 MiB) aligns with ADR-0039's PE_MMU stopgap default for direct-test ergonomics; the context fallback (4 KiB) is the safe minimum when topology.yaml doesn't specify a page size. The production path is always the latter (via _ensure_allocators), and when topology.yaml sets page_size, that value flows consistently into both the MMU and the VA allocator.

If consistency breaks (e.g., VirtualAllocator instantiated with a page_size different from PE_MMU's), MMU map() falls into the sub-page region mode (ADR-0039 D3).

VA range defaults: va_base = 0x1_0000_0000 (= 4 GiB), va_size = 64 GiB. These are hardcoded in _ensure_allocators and have no semantic meaning in ADR-0011's VA model — they simply reserve enough device-wide space without colliding with host code.

D5. Lifecycle of allocator instances

  • RuntimeContext._ensure_allocators is lazy — called on the first _create_tensor.
  • The allocator dict (self._allocators) lives for the RuntimeContext's lifetime. A second deploy in the same process does not construct new objects.
  • RuntimeContext.cleanup() walks living tensors and calls _free_tensor(), which issues MMU unmaps + va_allocator.free + pemem_allocator.free_hbm — restoring the free lists. A subsequent RuntimeContext starts fresh.

This per-RuntimeContext isolation guarantees deterministic deploy → cleanup → deploy sequences within a single process.

D6. Allocator failure raises (no silent OOM)

Both _FreeList.alloc and VirtualAllocator.alloc raise AllocationError / VaAllocationError when no block fits. The message includes "required size + largest available block" to distinguish fragmentation from true OOM.

A silent fallback (e.g., allocating only as much as the largest free block) is strictly forbidden — a partially-allocated tensor reaching SimPy would cause routing / DMA to see incorrect PAs and silently corrupt simulation results.

D7. One allocator per address space

Physical address spaces are separated by PhysAddr sub-units (ADR-0001 D2.3); each sub-unit gets its own allocator instance:

  • HBM slice → PEMemAllocator._hbm.
  • PE TCM → PEMemAllocator._tcm.
  • (Currently unused) M_CPU local memory, CUBE SRAM → would need their own allocators. Today these are handled as IPCQ-only slots (ADR-0023 D9.7) and do not share PA space, so no free-list exists for them.

When a cube-level SRAM allocator is needed, _FreeList(cfg.sram_bytes_per_cube) is added per-cube (cfg.sram_bytes_per_cube is already defined in AddressConfig — the data model is ready).

Alternatives Considered

A1. Best-fit / buddy allocator

Rejected (currently). The workload's alloc/free pattern is stack-like (deploy order ≈ free order), so first-fit + coalescing controls fragmentation well enough. If long-running fragmentation appears in LLM kernel sweeps, a buddy-allocator ADR will replace D1.

A2. Add partial-overlap free validation

Rejected. D2's trust model plus the O(N) per-free cost makes this unattractive. A debug mode (e.g., KERNBENCH_DEBUG env var) that enables the check could be added later.

A3. A unified allocator for VA and PA

Rejected. VA space (64 GiB device-wide) and PA space (per-slice ~6 GiB) have different semantic dimensions — VA is the kernel's view, PA is the device sub-unit's view. ADR-0011's VA model (MMU maps between the two) calls for separated allocators.

A4. Multi-tier page sizes (large pages + small pages)

Rejected (currently). A single page size (2 MiB) matches LLM kernel tensor sizes (a few MiB to GiB); smaller mappings are absorbed by ADR-0039 D3's sub-page region mode. Multi-tier paging would require extending the MMU model itself — a separate ADR candidate.

Consequences

  • The allocator algorithm is pinned at ADR level (D1, D3, D4), so any future simulation scenario hitting fragmentation has a clear "we're using first-fit + coalescing" anchor to inspect.
  • D2's trust model is explicit, so any future code path that exposes alloc/free to direct user input will trigger this ADR's supersession early.
  • D7's one-allocator-per-sub-unit mapping is on record, so when M_CPU or SRAM need their own free-list, the addition point is obvious.
  • D4 captures the page_size dual-default and its production path (_ensure_allocators always wins), letting future topology.yaml page_size changes be assessed against ADR-0039's stopgap interaction quickly.