Autotune FlashInfer operations. FlashInfer have many implementations for the same operation, autotuning runs benchmarks for each implementation and stores the results. The results are cached transparently and future calls to FlashInfer will use the best implementation. Without autotuning, FlashInfer will rely on heuristics, which may be significantly slower.
Source code in vllm/model_executor/warmup/kernel_warmup.py
| def flashinfer_autotune(runner: "GPUModelRunner") -> None:
"""
Autotune FlashInfer operations.
FlashInfer have many implementations for the same operation,
autotuning runs benchmarks for each implementation and stores
the results. The results are cached transparently and
future calls to FlashInfer will use the best implementation.
Without autotuning, FlashInfer will rely on heuristics, which may
be significantly slower.
"""
import vllm.utils.flashinfer as fi_utils
with torch.inference_mode(), fi_utils.autotune():
# Certain FlashInfer kernels (e.g. nvfp4 routed moe) are
# incompatible with autotuning. This state is used to skip
# those kernels during the autotuning process.
fi_utils._is_fi_autotuning = True
# We skip EPLB here since we don't want to record dummy metrics
# When autotuning with number of tokens m, flashinfer will autotune
# operations for all number of tokens up to m.
# So we only need to run with the max number of tokens.
runner._dummy_run(
runner.scheduler_config.max_num_batched_tokens,
skip_eplb=True,
is_profile=True,
)
fi_utils._is_fi_autotuning = False
|