tl.composite: fused epilogue ops with per-op scope
Extend tl.composite() with an ordered epilogue list. Each op carries
a scope flag - output_tile (default, runs once per (m,n) before
STORE), k_tile (every K-tile right after GEMM), or kernel. Plan
generator slots MATH stages by scope; pe_math reuses pe_dma's
local-loop pattern so chained epilogues (bias->relu) skip the port
hop. op_log captures per-stage params for telemetry. Topology
gains a gemm->math edge (snapshot test updated).
API stays backward-compatible - `epilogue=` is opt-in.
Example:
h = tl.composite(
op="gemm", a=a, b=b, out_ptr=int(out),
epilogue=[
{"op": "dequant", "scale": s_per_k, "scope": "k_tile"},
{"op": "bias", "bias": bias_vec},
{"op": "relu"},
{"op": "scale", "factor": 0.5},
],
)
tl.wait(h)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -84,6 +84,7 @@ cube:
|
||||
fetch_store_to_gemm_mm: 0.0 # fetch → GEMM chaining (ADR-0021)
|
||||
fetch_store_to_math_mm: 0.0 # fetch → MATH chaining (ADR-0021)
|
||||
gemm_to_fetch_store_mm: 0.0 # GEMM → store chaining (ADR-0021)
|
||||
gemm_to_math_mm: 0.0 # GEMM → MATH epilogue chaining (ADR-0021)
|
||||
math_to_fetch_store_mm: 0.0 # MATH → store chaining (ADR-0021)
|
||||
fetch_store_to_dma_mm: 0.0 # store → DMA writeback chaining (ADR-0021)
|
||||
gemm_to_tcm_bw_gbs: 512.0
|
||||
|
||||
Reference in New Issue
Block a user