adr: add ADR-0046-0049 — close G4 coverage gaps from /report
Documents four cross-cutting surfaces that previously had no ADR backing, each surfaced as a G4 candidate by /report: - 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...), the two execution modes (command-list vs greenlet runner), scratch allocator semantics, dispatch-overhead model, and the kernel registry. - 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group (backend="ahbm") install path. world_size priority (algorithm > defaults > topology), the 4-step init sequence (load ccl.yaml, import algorithm module, derive world_size, install SFR + IPCQ), greenlet- local rank registry, all_reduce dispatch via _defer_wait, barrier no-op rationale, and the explicit list of unsupported dist.* APIs. - 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator free-list semantics. Offset-keyed first-fit with coalescing, the no-validation trust model for free(), HBM/TCM channel separation, page-aligned VA allocation, the page_size dual-default (VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and one-allocator-per-sub-unit rule. - 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog. H2D / D2H / PE DMA categories with their exact cube-index choices, the 32 KiB reference size, the 5-point utilization sweep, the formula vs actual column meanings, automatic invariant checks (monotonicity, D2H >= H2D, best < worst), per-case GraphEngine isolation, and the human-readable (not machine-parsable) output contract. Bilingual pair verifier passes for all four EN/KO pairs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,278 @@
|
||||
# ADR-0048: Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
Pins down the free-list algorithm, page alignment, and coalescing rules
|
||||
used by `policy/address/allocator.py`'s `_FreeList` / `PEMemAllocator`
|
||||
and `va_allocator.py`'s `VirtualAllocator`. ADR-0001 (PhysAddr layout)
|
||||
and ADR-0011 (PA/VA/LA models) define the address schemes; the
|
||||
**allocation algorithms** had no ADR-level coverage.
|
||||
|
||||
## First action
|
||||
|
||||
### `_FreeList(capacity)`
|
||||
|
||||
On construction: `self._capacity = capacity`, `self._used = 0`,
|
||||
`self._free = [(0, capacity)]`. The first act is **establishing the
|
||||
entire region as one free block** — the tuple `(offset=0,
|
||||
size=capacity)` is the sole entry in the free list.
|
||||
|
||||
### `PEMemAllocator(sip_id, die_id, pe_id, cfg)`
|
||||
|
||||
On construction, builds two `_FreeList`s:
|
||||
|
||||
- `self._hbm = _FreeList(cfg.hbm_slice_bytes)` — the size of this PE's
|
||||
HBM slice (`hbm_bytes_per_cube // hbm_slices_per_cube`).
|
||||
- `self._tcm = _FreeList(cfg.tcm_allocatable_bytes)` — equals
|
||||
`tcm_bytes_per_pe - tcm_scheduler_reserved_bytes` (the scheduler
|
||||
reservation is pre-deducted).
|
||||
|
||||
So PEMemAllocator's first act is **constructing single-free-block
|
||||
HBM-slice and TCM regions for this PE**.
|
||||
|
||||
### `VirtualAllocator(va_base, va_size, page_size=2*1024*1024)`
|
||||
|
||||
On construction: `self._va_base = va_base`, `self._va_size = va_size`,
|
||||
`self._page_size = page_size`, `self._used = 0`, `self._free =
|
||||
[(va_base, va_size)]`. The first act is **establishing one block from
|
||||
va_base to va_size and stashing page_size**.
|
||||
|
||||
## Context
|
||||
|
||||
`runtime_api/context.py::_ensure_allocators` builds the allocator set
|
||||
in these stages:
|
||||
|
||||
1. Read `hbm_total_gb_per_cube`, `hbm_slices_per_cube`, `tcm_size_mb`,
|
||||
per-target_device SIP range, etc. from `spec`.
|
||||
2. Pack everything into a frozen `AddressConfig`.
|
||||
3. For every combination in the target SIP range × cubes × PEs,
|
||||
construct one `PEMemAllocator(sip, cube, pe, cfg)` instance.
|
||||
4. Construct one `VirtualAllocator(va_base=0x1_0000_0000, va_size=64
|
||||
GiB, page_size=pe_mmu.page_size)`.
|
||||
|
||||
Allocator responsibilities:
|
||||
|
||||
- **PEMemAllocator**: PA-space allocation in the PE-local HBM slice /
|
||||
TCM (including PhysAddr encoding).
|
||||
- **VirtualAllocator**: device-wide VA allocation, page-aligned.
|
||||
`RuntimeContext._create_tensor` then pushes VA → PA mappings to
|
||||
components via `MmuMapMsg`.
|
||||
|
||||
These algorithms are:
|
||||
|
||||
- **First-fit**, kept simple.
|
||||
- The free-block list is **sorted by start offset**.
|
||||
- On `free()`, **adjacent blocks coalesce**.
|
||||
|
||||
The rationale was not documented anywhere, so when someone asks "why
|
||||
not best-fit?", "why not a buddy allocator?", "why does partial-overlap
|
||||
free pass silently?", there was no anchor to answer from. This ADR
|
||||
provides it.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. `_FreeList` — offset-keyed first-fit + coalescing
|
||||
|
||||
`policy/address/allocator.py::_FreeList`:
|
||||
|
||||
- Internal representation: `list[tuple[int, int]] = [(start_offset,
|
||||
size), ...]` — sorted by start offset.
|
||||
- `alloc(nbytes)`:
|
||||
1. Iterate the free list from the front (first-fit).
|
||||
2. Take from the first block with `size >= nbytes`.
|
||||
3. Exact match → drop the block; otherwise shrink it to `(start +
|
||||
nbytes, size - nbytes)`.
|
||||
4. `_used += nbytes`; return the taken `start`.
|
||||
5. If no block fits, `AllocationError("overflow ... largest free
|
||||
block ...")`.
|
||||
- `free(offset, nbytes)`:
|
||||
1. `_used -= nbytes`.
|
||||
2. `bisect_left(self._free, (offset,))` finds the insertion index.
|
||||
3. If adjacent to the previous block (`prev_start + prev_size ==
|
||||
offset`), merge.
|
||||
4. If adjacent to the next block (`offset + nbytes == next_start`),
|
||||
merge.
|
||||
5. Insert the coalesced range at the right sorted position.
|
||||
|
||||
This algorithm is weaker than best-fit / buddy on fragmentation, but
|
||||
the simulator's workload (mostly stack-like deploy/free) tolerates it.
|
||||
If the workload shape changes, D1 is a supersession candidate.
|
||||
|
||||
### D2. Partial-overlap free is **not** validated
|
||||
|
||||
`_FreeList.free(offset, nbytes)` trusts the caller to pass the exact
|
||||
`(offset, nbytes)`. It does **not** verify:
|
||||
|
||||
- That the range was actually allocated.
|
||||
- That the range does not overlap another allocated region.
|
||||
|
||||
Reason: in a simulator context, callers always store the return value
|
||||
of `alloc()` and pass it back to `free()` — there is no external user
|
||||
input. Adding a safety check would cost O(N) per free and impact
|
||||
simulation wall-clock.
|
||||
|
||||
If this trust model breaks (e.g., a code path lets two tensors point
|
||||
at the same PA), this ADR must be revisited.
|
||||
|
||||
### D3. `PEMemAllocator` — two channels for HBM/TCM
|
||||
|
||||
`PEMemAllocator(sip_id, die_id, pe_id, cfg)` holds two `_FreeList`s:
|
||||
|
||||
- `_hbm`: size `cfg.hbm_slice_bytes`.
|
||||
- `_tcm`: size `cfg.tcm_allocatable_bytes` (= `tcm_bytes_per_pe -
|
||||
tcm_scheduler_reserved_bytes`).
|
||||
|
||||
`alloc_hbm(nbytes) -> PhysAddr`:
|
||||
|
||||
- `_hbm.alloc(nbytes)` → offset.
|
||||
- `PhysAddr.pe_hbm_addr(sip_id, die_id, pe_id,
|
||||
pe_local_hbm_offset=offset, slice_size_bytes=cfg.hbm_slice_bytes)`.
|
||||
- Failure raises `AllocationError("HBM overflow ...")`.
|
||||
|
||||
`free_hbm(pa, nbytes)`:
|
||||
|
||||
- Recover PE-local offset via `pa.hbm_offset - pe_id *
|
||||
cfg.hbm_slice_bytes`.
|
||||
- `_hbm.free(offset, nbytes)`.
|
||||
|
||||
`alloc_tcm(nbytes) -> PhysAddr`: similar; uses `PhysAddr.pe_tcm_addr`.
|
||||
|
||||
`free_tcm(pa, nbytes)`: uses `pa.sub_offset` directly (TCM's PE-local
|
||||
offset equals its sub_offset).
|
||||
|
||||
The allocator does not see the scheduler-reserved TCM region
|
||||
(`cfg.tcm_scheduler_reserved_bytes`) — it is pre-subtracted from the
|
||||
`_tcm` capacity. This is consistent with ADR-0014's PE_SCHEDULER
|
||||
internal-buffer reservation.
|
||||
|
||||
### D4. `VirtualAllocator` — page-aligned first-fit + coalescing
|
||||
|
||||
`policy/address/va_allocator.py::VirtualAllocator`:
|
||||
|
||||
- Internal representation: same sorted `list[tuple[int, int]]` as
|
||||
`_FreeList`. Initially `[(va_base, va_size)]`.
|
||||
- `_align_up(nbytes) = ceil(nbytes / page_size) * page_size`.
|
||||
- `alloc(nbytes) -> int`:
|
||||
1. `aligned = _align_up(nbytes)`.
|
||||
2. First-fit a block with `size >= aligned`.
|
||||
3. Take `aligned` from the block's front; remove if exact.
|
||||
4. `_used += aligned`. Return the block's `start` (which is page-
|
||||
aligned).
|
||||
5. Failure → `VaAllocationError`.
|
||||
- `free(va, nbytes)`: free `_align_up(nbytes)` worth. Coalesces with
|
||||
the same algorithm as `_FreeList`.
|
||||
|
||||
`page_size` has different defaults in two places:
|
||||
|
||||
- `VirtualAllocator.__init__`'s parameter default: `2 MiB`. Direct-call
|
||||
tests receive this.
|
||||
- `RuntimeContext._ensure_allocators` when constructing the instance:
|
||||
`pe_mmu.attrs.get("page_size", 4096)` — uses
|
||||
`topology.yaml`'s `pe_mmu.attrs.page_size` if set, else falls back
|
||||
to `4 KiB`.
|
||||
|
||||
The two defaults differ on purpose: `VirtualAllocator`'s standalone
|
||||
default (`2 MiB`) aligns with ADR-0039's PE_MMU stopgap default for
|
||||
direct-test ergonomics; the context fallback (`4 KiB`) is the safe
|
||||
minimum when `topology.yaml` doesn't specify a page size. The
|
||||
production path is always the latter (via `_ensure_allocators`), and
|
||||
when `topology.yaml` sets `page_size`, that value flows consistently
|
||||
into both the MMU and the VA allocator.
|
||||
|
||||
If consistency breaks (e.g., VirtualAllocator instantiated with a
|
||||
page_size different from PE_MMU's), MMU `map()` falls into the
|
||||
sub-page region mode (ADR-0039 D3).
|
||||
|
||||
VA range defaults: `va_base = 0x1_0000_0000` (= 4 GiB), `va_size = 64
|
||||
GiB`. These are hardcoded in `_ensure_allocators` and have no
|
||||
semantic meaning in ADR-0011's VA model — they simply reserve enough
|
||||
device-wide space without colliding with host code.
|
||||
|
||||
### D5. Lifecycle of allocator instances
|
||||
|
||||
- `RuntimeContext._ensure_allocators` is lazy — called on the first
|
||||
`_create_tensor`.
|
||||
- The allocator dict (`self._allocators`) lives for the
|
||||
RuntimeContext's lifetime. A second deploy in the same process
|
||||
does not construct new objects.
|
||||
- `RuntimeContext.cleanup()` walks living tensors and calls
|
||||
`_free_tensor()`, which issues MMU unmaps + `va_allocator.free` +
|
||||
`pemem_allocator.free_hbm` — restoring the free lists. A subsequent
|
||||
RuntimeContext starts fresh.
|
||||
|
||||
This per-RuntimeContext isolation guarantees deterministic deploy →
|
||||
cleanup → deploy sequences within a single process.
|
||||
|
||||
### D6. Allocator failure raises (no silent OOM)
|
||||
|
||||
Both `_FreeList.alloc` and `VirtualAllocator.alloc` raise
|
||||
`AllocationError` / `VaAllocationError` when no block fits. The message
|
||||
includes "required size + largest available block" to distinguish
|
||||
fragmentation from true OOM.
|
||||
|
||||
A silent fallback (e.g., allocating only as much as the largest free
|
||||
block) is strictly forbidden — a partially-allocated tensor reaching
|
||||
SimPy would cause routing / DMA to see incorrect PAs and silently
|
||||
corrupt simulation results.
|
||||
|
||||
### D7. One allocator per address space
|
||||
|
||||
Physical address spaces are separated by PhysAddr sub-units (ADR-0001
|
||||
D2.3); each sub-unit gets its own allocator instance:
|
||||
|
||||
- HBM slice → `PEMemAllocator._hbm`.
|
||||
- PE TCM → `PEMemAllocator._tcm`.
|
||||
- (Currently unused) M_CPU local memory, CUBE SRAM → would need their
|
||||
own allocators. Today these are handled as IPCQ-only slots (ADR-0023
|
||||
D9.7) and do not share PA space, so no free-list exists for them.
|
||||
|
||||
When a cube-level SRAM allocator is needed,
|
||||
`_FreeList(cfg.sram_bytes_per_cube)` is added per-cube
|
||||
(`cfg.sram_bytes_per_cube` is already defined in `AddressConfig` —
|
||||
the data model is ready).
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. Best-fit / buddy allocator
|
||||
|
||||
Rejected (currently). The workload's alloc/free pattern is stack-like
|
||||
(deploy order ≈ free order), so first-fit + coalescing controls
|
||||
fragmentation well enough. If long-running fragmentation appears in LLM
|
||||
kernel sweeps, a buddy-allocator ADR will replace D1.
|
||||
|
||||
### A2. Add partial-overlap free validation
|
||||
|
||||
Rejected. D2's trust model plus the O(N) per-free cost makes this
|
||||
unattractive. A debug mode (e.g., `KERNBENCH_DEBUG` env var) that
|
||||
enables the check could be added later.
|
||||
|
||||
### A3. A unified allocator for VA and PA
|
||||
|
||||
Rejected. VA space (64 GiB device-wide) and PA space (per-slice ~6
|
||||
GiB) have different semantic dimensions — VA is the kernel's view, PA
|
||||
is the device sub-unit's view. ADR-0011's VA model (MMU maps between
|
||||
the two) calls for separated allocators.
|
||||
|
||||
### A4. Multi-tier page sizes (large pages + small pages)
|
||||
|
||||
Rejected (currently). A single page size (2 MiB) matches LLM kernel
|
||||
tensor sizes (a few MiB to GiB); smaller mappings are absorbed by
|
||||
ADR-0039 D3's sub-page region mode. Multi-tier paging would require
|
||||
extending the MMU model itself — a separate ADR candidate.
|
||||
|
||||
## Consequences
|
||||
|
||||
- The allocator algorithm is pinned at ADR level (D1, D3, D4), so any
|
||||
future simulation scenario hitting fragmentation has a clear "we're
|
||||
using first-fit + coalescing" anchor to inspect.
|
||||
- D2's trust model is explicit, so any future code path that exposes
|
||||
alloc/free to direct user input will trigger this ADR's supersession
|
||||
early.
|
||||
- D7's one-allocator-per-sub-unit mapping is on record, so when M_CPU
|
||||
or SRAM need their own free-list, the addition point is obvious.
|
||||
- D4 captures the page_size dual-default and its production path
|
||||
(`_ensure_allocators` always wins), letting future `topology.yaml`
|
||||
`page_size` changes be assessed against ADR-0039's stopgap
|
||||
interaction quickly.
|
||||
Reference in New Issue
Block a user