Documents four cross-cutting surfaces that previously had no ADR backing, each surfaced as a G4 candidate by /report: - 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...), the two execution modes (command-list vs greenlet runner), scratch allocator semantics, dispatch-overhead model, and the kernel registry. - 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group (backend="ahbm") install path. world_size priority (algorithm > defaults > topology), the 4-step init sequence (load ccl.yaml, import algorithm module, derive world_size, install SFR + IPCQ), greenlet- local rank registry, all_reduce dispatch via _defer_wait, barrier no-op rationale, and the explicit list of unsupported dist.* APIs. - 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator free-list semantics. Offset-keyed first-fit with coalescing, the no-validation trust model for free(), HBM/TCM channel separation, page-aligned VA allocation, the page_size dual-default (VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and one-allocator-per-sub-unit rule. - 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog. H2D / D2H / PE DMA categories with their exact cube-index choices, the 32 KiB reference size, the 5-point utilization sweep, the formula vs actual column meanings, automatic invariant checks (monotonicity, D2H >= H2D, best < worst), per-case GraphEngine isolation, and the human-readable (not machine-parsable) output contract. Bilingual pair verifier passes for all four EN/KO pairs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 KiB
ADR-0046: TLContext — Kernel-side tl.* API Contract
Status
Accepted (2026-05-22).
Documents the set of tl.* primitives exposed by
src/kernbench/triton_emu/'s TLContext, their semantics, and the two
execution-mode contracts (command-list / greenlet runner). ADR-0014/0020
defines the PE pipeline and the 2-pass execution model, but the tl.*
surface that bench kernel functions call had no ADR-level coverage.
First action
When TLContext(pe_id, num_programs, dispatch_cycles, runner, cube_id, num_cubes, scratch_base, scratch_size) is instantiated, the first action
is to initialize six categories of state:
self._pe_id,self._num_programs,self._cube_id,self._num_cubes— values thattl.program_id/tl.num_programswill return.self._dispatch_cycles— cycle count emitted asPeCpuOverheadCmd(cycles)at the start of everytl.*API call.self._runner—KernelRunnerinstance (present → greenlet mode; absent → command-list mode).self._commands: list[PeCommand] = []— command-list accumulator (command-list mode only).self._handle_counter = 0,self._completion_counter = 0— counters for generating TensorHandle / CompletionHandle ids.self._scratch_base,self._scratch_size,self._scratch_cursor = 0— PE-local scratch region (used for math/dot/composite output handle addresses).
In short, TLContext's first act is "record where (sip/cube/pe) and at what scale (num_programs/num_cubes) this kernel instance runs, and pick its dispatch mode (runner present or not)". No SimPy event is created and no command is emitted at this moment.
The runtime first action happens when the kernel function first calls a
tl.<api>(). The standard entry for every tl.* API is:
- Call
self._emit_dispatch_overhead()— ifdispatch_cycles > 0, immediately_emitaPeCpuOverheadCmd(dispatch_cycles). - Per-API processing (TensorHandle creation, command construction).
self._emit(cmd)— in runner mode thisgreenlet.switch()es the cmd to SimPy; in command-list mode it appends toself._commands.
Context
The tl.* surface consists of TLContext's methods, and the tl
parameter received by a kernel function is one of these objects. The
contract the user (bench author) sees:
- Which primitives exist.
- What data flow each primitive triggers (DMA / compute / IPCQ / metadata-only).
- How a TensorHandle's
spaceandaddrare decided. - The difference between command-list and greenlet modes.
ADR-0014 (PE pipeline) defines the PeCommands consumed by PE_SCHEDULER,
but how tl.* emits them is a code-only convention. ADR-0020 (2-pass
data execution) mentions greenlet mode in D3 but does not pin down the
signature difference (return-value handling) between the runner /
non-runner paths. This ADR fills the gap.
Decision
D1. The tl parameter is a TLContext instance
A bench kernel function has the signature:
def _kernel(arg1, arg2, ..., tl, **kwargs):
...
tl is a kernbench.triton_emu.tl_context.TLContext instance. The name
imitates real Triton's triton.language module; the actual Triton
module is not passed in.
The kernel is plain Python — no yield or async. tl.* calls produce
SimPy events, but to the caller they appear synchronous because in
greenlet mode the KernelRunner relays between SimPy and the kernel
(ADR-0020 D3).
D2. Two execution modes — command-list / greenlet runner
-
Command-list mode (
runner is None):tl.*calls append PeCommand toself._commands. DMA / GEMM / Math consume no SimPy time and return metadata-only TensorHandles (data=None). PE_SCHEDULER / sim_engine later replays the command sequence in time. -
Greenlet runner mode (
runner is not None):tl.*callsself._emit(cmd)→runner.switch_to_simpy(cmd), handing control to the parent greenlet (SimPy). The parent distributes the cmd to components, consumes SimPy time, and (for DMA reads) returns real numpy data. The kernel receives the result and continues to the next line (the data-aware execution model from ADR-0020 D3).
The choice of mode is decided by whether a KernelRunner is injected into
the TLContext. The tl.* methods themselves are mode-blind — they go
through _emit() uniformly.
D3. Primitive categories
D3.1. Reference (no DMA, metadata only)
tl.ref(ptr, shape, dtype="f16") -> TensorHandle: create a handle referencing HBM data without issuing DMA. Used when the scheduler streams the data per-tile (e.g., the b operand of a composite GEMM).
D3.2. Data movement (blocking, DMA engine)
tl.load(ptr, shape, dtype="f16") -> TensorHandle: HBM → handle. EmitsDmaReadCmd. In greenlet mode the returned handle's.datacarries real numpy data; in command-list mode it is a placeholder. The handle hasspace="hbm",pinned=True.tl.store(ptr, handle) -> None: TCM → HBM. EmitsDmaWriteCmd. In greenlet mode, whenhandle.datais present,_store.write("hbm", ptr, data)runs first (visibility = issue time, ADR-0020 D3).
D3.3. GEMM / compute (blocking)
tl.dot(a, b) -> TensorHandle:a @ b. Both operands must live in TCM; shapes(M,K) × (K,N) → (M,N). EmitsGemmCmd; the output handle is allocated from PE-local scratch via_make_compute_out(shape, dtype).tl.composite(op, a, b=None, out_ptr=0, math_op=None, epilogue=None, acc_dtype=None, tile_shape=None) -> CompletionHandle: non-blocking tiled pipeline. EmitsCompositeCmd.epilogueis a list of dicts, each with"op"plus op-specific fields and an optional"scope"(k_tile / output_tile). Unknown ops or missing fields raise ValueError immediately. The returned CompletionHandle synchronizes viatl.wait(h).
D3.4. Math: unary (blocking)
tl.exp(x),tl.log(x),tl.sqrt(x),tl.abs(x),tl.sigmoid(x),tl.cos(x),tl.sin(x)— each emitsMathCmd(op=<name>, inputs=(x,), out=).outis scratch-allocated with the same shape/dtype asx.
D3.5. Math: binary (blocking)
tl.maximum(a, b),tl.minimum(a, b)—_binary_math.tl.fma(a, b, c)—a*b + c. Three inputs.tl.clamp(x, min, max)—MathCmd(op="clamp", inputs=(x, min, max)).tl.where(cond, a, b)—MathCmd(op="where", inputs=(cond, a, b)).tl.softmax(x, axis=-1)— a singleMathCmd(op="softmax")so timing accounts at one dispatch. Phase 2 DataExecutor expands it to the canonical (x-max → exp → sum → div) sequence.
D3.6. Reduction (blocking)
tl.sum(x, axis),tl.max(x, axis),tl.min(x, axis)— return an output handle with the axis size collapsed to 1. EmitMathCmd(op=<name>, inputs=(x,), out=, axis=axis).
D3.7. Index / scalar (PE_CPU, no engine)
tl.program_id(axis=0) -> int:axis==0→ pe_id (cube-local PE index),axis==1→ cube_id (ADR-0022).tl.num_programs(axis=0) -> int:axis==0→ num_programs (PEs per cube),axis==1→ num_cubes.tl.arange(start, end, dtype="i32") -> TensorHandle: an index range in TCM. No command emitted.tl.zeros(shape, dtype="f16") -> TensorHandle,tl.full(shape, value, dtype="f16") -> TensorHandle: TCM placeholder. No command emitted.
D3.8. Scalar helpers (no command, no engine)
TLContext.cdiv(a, b) -> int(static): ceiling division-(-a // b). Mirrors real Triton'stl.cdiv.
D3.9. Metadata-only (no compute, no DMA)
tl.trans(x) -> TensorHandle: a new handle with the last two dims swapped. Sharesaddranddata; no command emitted.
D3.10. IPCQ (CCL) primitives (ADR-0023 D4)
tl.send(dir, src=None, *, src_addr=None, nbytes=None, shape=None, dtype="f16", space="tcm") -> None: blocking send. Accepts either handle form or raw-address form. EmitsIpcqSendCmd. The handle's.datasnapshot rides along on the command — avoiding the race where a later inbound IPCQ overwrites the slot before the outbound PE_DMA reads it.tl.recv(dir=None, shape=(), dtype="f16", space="tcm", dst_addr=None, dst_space=None) -> TensorHandle: blocking recv. Providing bothdst_addranddst_spaceenters "copy_to_dst" mode; otherwise "return_slot" mode. In greenlet mode the handle's.datacarries the real data.tl.recv_no_consume(dir=None, shape=(), dtype="f16") -> TensorHandle: DIAGNOSTIC ONLY. Has the same blocking-arrival semantics astl.recvbut skips the slot-read latency charge (slot-IO + PE↔bank fabric drain). Used in the pe2pe overview plot for an apples-to-apples comparison againsttl.store. Production kernels MUST NOT use it — the diagnostic flag is isolated in its own command branch (consume=False) so it cannot be accidentally enabled.tl.recv_async(dir, shape=(), dtype="f16") -> RecvFuture: non-blocking recv. Returns aRecvFuture; resolved later bytl.wait(future).
D3.11. Composite + control
tl.composite(...): see D3.3.tl.wait(handle=None): wait on aCompletionHandle(composite), aRecvFuture(async recv), orNone(all pending composites).tl.cycles(n): declare a scalar PE_CPU overhead. EmitsPeCpuOverheadCmd(cycles=n).
D4. TensorHandle arithmetic operators — thread-local TLContext
At module load, tl_context.py::_enable_tensor_ops() runs and patches
TensorHandle.__add__, __sub__, __mul__, __truediv__. Each
operator calls _binary_math on the active TLContext stored in a
module-level thread-local _ctx.
So inside a kernel, c = a + b is equivalent to emitting
MathCmd(op="add", inputs=(a, b), out=) and returning a new
TensorHandle.
Active-TLContext management:
TLContext._set_active(ctx): set the active ctx for the current thread/greenlet.TLContext._get_active(): read it (RuntimeError if unset).run_kernel(kernel_fn, tl_ctx, *args, **kwargs): helper. Sets active on entry, runs the kernel, restoresNoneon exit.
KernelRunner re-asserts _set_active(tl) inside its _switch_kernel
just before resuming the kernel, so a sibling PE runner that overwrote
the thread-local context is correctly recovered.
D5. Scratch allocator — compute output handles
Ops that produce a result — tl.dot, tl.exp, tl.add (via
TensorHandle __add__), etc. — call _make_compute_out(shape, dtype)
to obtain a 16-byte-aligned scratch address. The address is published
with space="tcm", so the handle can later be the source of a
tl.send / tl.store.
When _scratch_base == 0 (e.g., command-list mode), the address is 0
and the handle cannot be a send/store source (in that case, only
tl.load-returned handles are valid sources).
When the cursor exceeds _scratch_size (default 1 MiB), a
RuntimeError is raised. The cursor must reset between kernel
invocations (current code naturally satisfies this: KernelRunner
creates a fresh TLContext each time).
D6. Dispatch overhead — PeCpuOverheadCmd(dispatch_cycles)
Every non-metadata tl.* call starts with _emit_dispatch_overhead(),
which — when dispatch_cycles > 0 — emits
PeCpuOverheadCmd(dispatch_cycles). This models the cycles PE_CPU
spends dispatching the command.
Defaults:
TLContext.__init__'sdispatch_cyclesparameter default:1cycle.- TLContext built by
KernelRunner:0cycles (greenlet mode handles cycle accounting differently — aligned with ADR-0020 D3 intent).
D7. Kernel registry (triton_emu/registry.py)
A separate _kernels: dict[str, Callable] holds the name → function
mapping:
register_kernel(name, fn): ValueError on duplicate.get_kernel(name): KeyError if missing.clear_registry(): test-only.
RuntimeContext.launch(kernel_name, kernel_fn, *args) overwrites
_kernels[kernel_name] = kernel_fn on every call (last-call-wins,
idempotent) — consistent with ADR-0045 D8's launch behavior.
PE_CPU looks up KernelRef.name in the registry and runs the function
through KernelRunner.
Alternatives Considered
A1. Fold tl.* into ADR-0014 / ADR-0020
Rejected. ADR-0014 covers the PE pipeline (sim_engine-side consumption
of PeCommands); ADR-0020 covers 2-pass execution (Phase 1 timing /
Phase 2 data). The tl.* surface is what the kernel author touches; a
dedicated ADR improves findability and onboarding.
A2. Deprecate command-list mode
Rejected (currently). Simple unit tests and kernel verification benefit from the lighter command-list path — it exposes a PeCommand sequence inspector without requiring greenlet machinery. When greenlet-mode semantics (real data, Phase 2) are needed, D2 explicitly selects them.
A3. Remove TensorHandle arithmetic operators
Rejected. They mimic real Triton kernel ergonomics (e.g., c = a + b),
and the thread-local active-ctx pattern works cleanly. The explicit
function-form (tl.add(a, b)) is also exposed in D3.5, so the
operators are syntactic sugar.
A4. Expand softmax into the explicit sequence (max → exp → sum → div)
Partially adopted. tl.softmax is a single MathCmd(op="softmax") for
timing accounting (D3.5), but Phase 2 DataExecutor expands it to the
canonical sequence for real-data computation. Timing model atomic,
data model expanded — the two split intentionally.
Consequences
- Every
tl.*primitive a bench author meets is classified and defined in a single ADR. Paired with ADR-0045 D8's host-side surface (torch.emptyetc.), the inside-kernel and outside-kernel authoring guides are now complete. - The command-list / greenlet difference is pinned in D2, so any new
tl.*primitive that follows the_emit()pattern auto-supports both modes. - The thread-local active-ctx pattern (D4) is justified at ADR level,
clarifying who owns the reset responsibility when multiple PE
runners share a thread (KernelRunner.run's contract restores active
inside
_switch_kernel). tl.recv_no_consume's diagnostic isolation (D3.10) is hardened in ADR form — accidental production use is blocked by a separate command branch.- The registry (D7) gets its own D-section, formalizing the name-collision and dynamic-re-registration semantics.