Files
kernbench2/docs/adr/ADR-0048-mem-allocator-algorithms.md
T
ywkang 9a02955770 adr: add ADR-0046-0049 — close G4 coverage gaps from /report
Documents four cross-cutting surfaces that previously had no ADR backing,
each surfaced as a G4 candidate by /report:

- 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates
  all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...),
  the two execution modes (command-list vs greenlet runner), scratch
  allocator semantics, dispatch-overhead model, and the kernel registry.

- 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group
  (backend="ahbm") install path. world_size priority (algorithm >
  defaults > topology), the 4-step init sequence (load ccl.yaml, import
  algorithm module, derive world_size, install SFR + IPCQ), greenlet-
  local rank registry, all_reduce dispatch via _defer_wait, barrier
  no-op rationale, and the explicit list of unsupported dist.* APIs.

- 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator
  free-list semantics. Offset-keyed first-fit with coalescing, the
  no-validation trust model for free(), HBM/TCM channel separation,
  page-aligned VA allocation, the page_size dual-default
  (VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and
  one-allocator-per-sub-unit rule.

- 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog.
  H2D / D2H / PE DMA categories with their exact cube-index choices,
  the 32 KiB reference size, the 5-point utilization sweep, the
  formula vs actual column meanings, automatic invariant checks
  (monotonicity, D2H >= H2D, best < worst), per-case GraphEngine
  isolation, and the human-readable (not machine-parsable) output
  contract.

Bilingual pair verifier passes for all four EN/KO pairs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:25:04 -07:00

279 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0048: Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
## Status
Accepted (2026-05-22).
Pins down the free-list algorithm, page alignment, and coalescing rules
used by `policy/address/allocator.py`'s `_FreeList` / `PEMemAllocator`
and `va_allocator.py`'s `VirtualAllocator`. ADR-0001 (PhysAddr layout)
and ADR-0011 (PA/VA/LA models) define the address schemes; the
**allocation algorithms** had no ADR-level coverage.
## First action
### `_FreeList(capacity)`
On construction: `self._capacity = capacity`, `self._used = 0`,
`self._free = [(0, capacity)]`. The first act is **establishing the
entire region as one free block** — the tuple `(offset=0,
size=capacity)` is the sole entry in the free list.
### `PEMemAllocator(sip_id, die_id, pe_id, cfg)`
On construction, builds two `_FreeList`s:
- `self._hbm = _FreeList(cfg.hbm_slice_bytes)` — the size of this PE's
HBM slice (`hbm_bytes_per_cube // hbm_slices_per_cube`).
- `self._tcm = _FreeList(cfg.tcm_allocatable_bytes)` — equals
`tcm_bytes_per_pe - tcm_scheduler_reserved_bytes` (the scheduler
reservation is pre-deducted).
So PEMemAllocator's first act is **constructing single-free-block
HBM-slice and TCM regions for this PE**.
### `VirtualAllocator(va_base, va_size, page_size=2*1024*1024)`
On construction: `self._va_base = va_base`, `self._va_size = va_size`,
`self._page_size = page_size`, `self._used = 0`, `self._free =
[(va_base, va_size)]`. The first act is **establishing one block from
va_base to va_size and stashing page_size**.
## Context
`runtime_api/context.py::_ensure_allocators` builds the allocator set
in these stages:
1. Read `hbm_total_gb_per_cube`, `hbm_slices_per_cube`, `tcm_size_mb`,
per-target_device SIP range, etc. from `spec`.
2. Pack everything into a frozen `AddressConfig`.
3. For every combination in the target SIP range × cubes × PEs,
construct one `PEMemAllocator(sip, cube, pe, cfg)` instance.
4. Construct one `VirtualAllocator(va_base=0x1_0000_0000, va_size=64
GiB, page_size=pe_mmu.page_size)`.
Allocator responsibilities:
- **PEMemAllocator**: PA-space allocation in the PE-local HBM slice /
TCM (including PhysAddr encoding).
- **VirtualAllocator**: device-wide VA allocation, page-aligned.
`RuntimeContext._create_tensor` then pushes VA → PA mappings to
components via `MmuMapMsg`.
These algorithms are:
- **First-fit**, kept simple.
- The free-block list is **sorted by start offset**.
- On `free()`, **adjacent blocks coalesce**.
The rationale was not documented anywhere, so when someone asks "why
not best-fit?", "why not a buddy allocator?", "why does partial-overlap
free pass silently?", there was no anchor to answer from. This ADR
provides it.
## Decision
### D1. `_FreeList` — offset-keyed first-fit + coalescing
`policy/address/allocator.py::_FreeList`:
- Internal representation: `list[tuple[int, int]] = [(start_offset,
size), ...]` — sorted by start offset.
- `alloc(nbytes)`:
1. Iterate the free list from the front (first-fit).
2. Take from the first block with `size >= nbytes`.
3. Exact match → drop the block; otherwise shrink it to `(start +
nbytes, size - nbytes)`.
4. `_used += nbytes`; return the taken `start`.
5. If no block fits, `AllocationError("overflow ... largest free
block ...")`.
- `free(offset, nbytes)`:
1. `_used -= nbytes`.
2. `bisect_left(self._free, (offset,))` finds the insertion index.
3. If adjacent to the previous block (`prev_start + prev_size ==
offset`), merge.
4. If adjacent to the next block (`offset + nbytes == next_start`),
merge.
5. Insert the coalesced range at the right sorted position.
This algorithm is weaker than best-fit / buddy on fragmentation, but
the simulator's workload (mostly stack-like deploy/free) tolerates it.
If the workload shape changes, D1 is a supersession candidate.
### D2. Partial-overlap free is **not** validated
`_FreeList.free(offset, nbytes)` trusts the caller to pass the exact
`(offset, nbytes)`. It does **not** verify:
- That the range was actually allocated.
- That the range does not overlap another allocated region.
Reason: in a simulator context, callers always store the return value
of `alloc()` and pass it back to `free()` — there is no external user
input. Adding a safety check would cost O(N) per free and impact
simulation wall-clock.
If this trust model breaks (e.g., a code path lets two tensors point
at the same PA), this ADR must be revisited.
### D3. `PEMemAllocator` — two channels for HBM/TCM
`PEMemAllocator(sip_id, die_id, pe_id, cfg)` holds two `_FreeList`s:
- `_hbm`: size `cfg.hbm_slice_bytes`.
- `_tcm`: size `cfg.tcm_allocatable_bytes` (= `tcm_bytes_per_pe -
tcm_scheduler_reserved_bytes`).
`alloc_hbm(nbytes) -> PhysAddr`:
- `_hbm.alloc(nbytes)` → offset.
- `PhysAddr.pe_hbm_addr(sip_id, die_id, pe_id,
pe_local_hbm_offset=offset, slice_size_bytes=cfg.hbm_slice_bytes)`.
- Failure raises `AllocationError("HBM overflow ...")`.
`free_hbm(pa, nbytes)`:
- Recover PE-local offset via `pa.hbm_offset - pe_id *
cfg.hbm_slice_bytes`.
- `_hbm.free(offset, nbytes)`.
`alloc_tcm(nbytes) -> PhysAddr`: similar; uses `PhysAddr.pe_tcm_addr`.
`free_tcm(pa, nbytes)`: uses `pa.sub_offset` directly (TCM's PE-local
offset equals its sub_offset).
The allocator does not see the scheduler-reserved TCM region
(`cfg.tcm_scheduler_reserved_bytes`) — it is pre-subtracted from the
`_tcm` capacity. This is consistent with ADR-0014's PE_SCHEDULER
internal-buffer reservation.
### D4. `VirtualAllocator` — page-aligned first-fit + coalescing
`policy/address/va_allocator.py::VirtualAllocator`:
- Internal representation: same sorted `list[tuple[int, int]]` as
`_FreeList`. Initially `[(va_base, va_size)]`.
- `_align_up(nbytes) = ceil(nbytes / page_size) * page_size`.
- `alloc(nbytes) -> int`:
1. `aligned = _align_up(nbytes)`.
2. First-fit a block with `size >= aligned`.
3. Take `aligned` from the block's front; remove if exact.
4. `_used += aligned`. Return the block's `start` (which is page-
aligned).
5. Failure → `VaAllocationError`.
- `free(va, nbytes)`: free `_align_up(nbytes)` worth. Coalesces with
the same algorithm as `_FreeList`.
`page_size` has different defaults in two places:
- `VirtualAllocator.__init__`'s parameter default: `2 MiB`. Direct-call
tests receive this.
- `RuntimeContext._ensure_allocators` when constructing the instance:
`pe_mmu.attrs.get("page_size", 4096)` — uses
`topology.yaml`'s `pe_mmu.attrs.page_size` if set, else falls back
to `4 KiB`.
The two defaults differ on purpose: `VirtualAllocator`'s standalone
default (`2 MiB`) aligns with ADR-0039's PE_MMU stopgap default for
direct-test ergonomics; the context fallback (`4 KiB`) is the safe
minimum when `topology.yaml` doesn't specify a page size. The
production path is always the latter (via `_ensure_allocators`), and
when `topology.yaml` sets `page_size`, that value flows consistently
into both the MMU and the VA allocator.
If consistency breaks (e.g., VirtualAllocator instantiated with a
page_size different from PE_MMU's), MMU `map()` falls into the
sub-page region mode (ADR-0039 D3).
VA range defaults: `va_base = 0x1_0000_0000` (= 4 GiB), `va_size = 64
GiB`. These are hardcoded in `_ensure_allocators` and have no
semantic meaning in ADR-0011's VA model — they simply reserve enough
device-wide space without colliding with host code.
### D5. Lifecycle of allocator instances
- `RuntimeContext._ensure_allocators` is lazy — called on the first
`_create_tensor`.
- The allocator dict (`self._allocators`) lives for the
RuntimeContext's lifetime. A second deploy in the same process
does not construct new objects.
- `RuntimeContext.cleanup()` walks living tensors and calls
`_free_tensor()`, which issues MMU unmaps + `va_allocator.free` +
`pemem_allocator.free_hbm` — restoring the free lists. A subsequent
RuntimeContext starts fresh.
This per-RuntimeContext isolation guarantees deterministic deploy →
cleanup → deploy sequences within a single process.
### D6. Allocator failure raises (no silent OOM)
Both `_FreeList.alloc` and `VirtualAllocator.alloc` raise
`AllocationError` / `VaAllocationError` when no block fits. The message
includes "required size + largest available block" to distinguish
fragmentation from true OOM.
A silent fallback (e.g., allocating only as much as the largest free
block) is strictly forbidden — a partially-allocated tensor reaching
SimPy would cause routing / DMA to see incorrect PAs and silently
corrupt simulation results.
### D7. One allocator per address space
Physical address spaces are separated by PhysAddr sub-units (ADR-0001
D2.3); each sub-unit gets its own allocator instance:
- HBM slice → `PEMemAllocator._hbm`.
- PE TCM → `PEMemAllocator._tcm`.
- (Currently unused) M_CPU local memory, CUBE SRAM → would need their
own allocators. Today these are handled as IPCQ-only slots (ADR-0023
D9.7) and do not share PA space, so no free-list exists for them.
When a cube-level SRAM allocator is needed,
`_FreeList(cfg.sram_bytes_per_cube)` is added per-cube
(`cfg.sram_bytes_per_cube` is already defined in `AddressConfig` —
the data model is ready).
## Alternatives Considered
### A1. Best-fit / buddy allocator
Rejected (currently). The workload's alloc/free pattern is stack-like
(deploy order ≈ free order), so first-fit + coalescing controls
fragmentation well enough. If long-running fragmentation appears in LLM
kernel sweeps, a buddy-allocator ADR will replace D1.
### A2. Add partial-overlap free validation
Rejected. D2's trust model plus the O(N) per-free cost makes this
unattractive. A debug mode (e.g., `KERNBENCH_DEBUG` env var) that
enables the check could be added later.
### A3. A unified allocator for VA and PA
Rejected. VA space (64 GiB device-wide) and PA space (per-slice ~6
GiB) have different semantic dimensions — VA is the kernel's view, PA
is the device sub-unit's view. ADR-0011's VA model (MMU maps between
the two) calls for separated allocators.
### A4. Multi-tier page sizes (large pages + small pages)
Rejected (currently). A single page size (2 MiB) matches LLM kernel
tensor sizes (a few MiB to GiB); smaller mappings are absorbed by
ADR-0039 D3's sub-page region mode. Multi-tier paging would require
extending the MMU model itself — a separate ADR candidate.
## Consequences
- The allocator algorithm is pinned at ADR level (D1, D3, D4), so any
future simulation scenario hitting fragmentation has a clear "we're
using first-fit + coalescing" anchor to inspect.
- D2's trust model is explicit, so any future code path that exposes
alloc/free to direct user input will trigger this ADR's supersession
early.
- D7's one-allocator-per-sub-unit mapping is on record, so when M_CPU
or SRAM need their own free-list, the addition point is obvious.
- D4 captures the page_size dual-default and its production path
(`_ensure_allocators` always wins), letting future `topology.yaml`
`page_size` changes be assessed against ADR-0039's stopgap
interaction quickly.